SlideShare a Scribd company logo
Kai Wähner
Technology Evangelist
kontakt@kai-waehner.de
LinkedIn
@KaiWaehner
www.kai-waehner.de
February 2017
Data Preprocessing vs. Data Wrangling in Machine Learning / Deep Learning Projects
© Copyright 2000-2017 TIBCO Software Inc.
A key task to create appropriate analytic models in machine learning or deep learning is
the integration and preparation of data sets from various sources like files, databases, big
data storages, sensors or social networks. This step can take up to 50% of the whole
project.
This session compares different alternative techniques to prepare data, including extract-
transform-load (ETL) batch processing, streaming analytics ingestion, and data wrangling
within visual analytics. Various options and their trade-offs are shown in live demos using
different advanced analytics technologies and open source frameworks such as R, Python,
Apache Spark, Talend or KNIME. The session also discusses how this is related to visual
analytics, and best practices for how the data scientist and business user should work
together to build good analytic models.
Key takeaways for the audience:
- Learn various options for preparing data sets to build analytic models
- Understand the pros and cons and the targeted persona for each option
- See different technologies and open source frameworks for data preparation
- Understand the relation to visual analytics and streaming analytics, and how these
concepts are actually leveraged to build the analytic model after data preparation
Comparison of Data Preprocessing vs. Data Wrangling vs. ETL vs. Streaming Ingestion
in Machine Learning / Deep Learning Projects
© Copyright 2000-2017 TIBCO Software Inc.
Key Takeaways
Ø Various languages, frameworks and tools for data preparation - trade-offs included
Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool
Ø Visual analytics and open source data science components are complementary
Ø Avoiding numerous components speeds up a data science project
… for Data Preparation in Data Science:
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
From Insight to Action - Closed Loop for Big Data Analytics
Insight ActionEVENTSEVENTS
© Copyright 2000-2017 TIBCO Software Inc.
From Insight to Action - Closed Loop for Big Data Analytics
Insight Action
MONITOR
PREDICT
ACT
DECIDE
MODEL
ACCESS
ANALYZE
WRANGLE
© Copyright 2000-2017 TIBCO Software Inc.
Analyst Reports 2016
Magic Quadrant for Advanced Analytics Platforms
The Forrester Wave: Enterprise Insight Platform SuitesMagic Quadrant for Data Integration Tools
Magic Quadrant for BI and Analytics
© Copyright 2000-2017 TIBCO Software Inc.
Demystify Data Science for the Business Analyst
Leverage Machine Learning
without
help of a Data Scientist
© Copyright 2000-2017 TIBCO Software Inc.
• Business User / Analyst
• Data Scientist
• Citizen Data Scientist
• Developer
User Roles
AI-DRIVEN VISUAL ANALYTICS
DATA DISCOVERY DASHBOARDS
DATA SCIENCE RE-IMAGINED
PREDICTIVE MACHINE LEARNING
STREAMING ANALYTICS
REAL TIME ACTIONABLE
© Copyright 2000-2017 TIBCO Software Inc.
• “The heart of data science”
• Domain knowledge is very important
• Often takes 60% to 80% of the whole analytical pipeline
• Get the best accuracy from machine learning algorithms on your datasets
• Cannot be fully automated (at least not in the beginning)
Data Preparation
http://www.slideshare.net/odsc/feature-engineering
Data Preparation
© Copyright 2000-2017 TIBCO Software Inc.
• Basics (select, filter, removal of duplicates, …)
• Sampling (balanced, stratisfied, ...)
• Data Partitioning (create training + validation + test data set, ...)
• Transformations (normalisation, standardisation, scaling, pivoting, ...)
• Binning (count-based, handling of missing values as its own group, …)
• Data Replacement (cutting, splitting, merging, ...)
• Weighting and Selection (attribute weighting, automatic optimization, ...)
• Attribute Generation (ID generation, ...)
• Imputation (replacement of missing observations by using statistical algorithms)
Data Cleaning
© Copyright 2000-2017 TIBCO Software Inc.
• Using domain knowledge of the data to create features that make machine learning algorithms work
• Fundamental to the application of machine learning
• Both difficult and expensive
• Part of Model Building, but also includes Data Preparation
Feature Engineering
The process of feature engineering
• Brainstorming Or Testing features
• Deciding what features to create
• Creating features
• Checking how the features work
with your model
• Improving your features if needed
• Go back to brainstorming/creating
more features until the work is done
© Copyright 2000-2017 TIBCO Software Inc.
Analytical Pipeline
1. Data Access
2. Data Preprocessing
3. Exploratory Data Analysis
4. Model Building
5. Model Validation
6. Model Execution
7. Deployment
© Copyright 2000-2017 TIBCO Software Inc.
Google Trends
© Copyright 2000-2017 TIBCO Software Inc.
Data Preparation in the Analytical Pipeline
1. Data Access
2. Data Preprocessing
3. Exploratory Data Analysis
4. Model Building
5. Model Validation
6. Model Execution
7. Deployment
Data Preprocessing
+
Data Wrangling
=
Success
Reference Architecture for Big Data Analytics
Operational	Analytics
OperationsLive	UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming	AnalyticsAction
Aggregate
Rules
Stream	Processing
Analytics
Correlate
Live	Monitoring
Continuous	query	
processing
Alerts
Manual	action,	
escalation
HISTORICAL	ANALYSIS
Data	Sheets
BI
Data	
Scientists
Cleansed
Data
History
Data	Discovery
Enterprise	Service	Bus
ERP MDM DB WMS
SOA
Data	Storage
Internal	Data
Integration	Bus
API
Event	Server
Machine	
Learning
Big	Data
Reference Architecture for Big Data Analytics
Operational	Analytics
OperationsLive	UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming	AnalyticsAction
Aggregate
Rules
Stream	Processing
Analytics
Correlate
Live	Monitoring
Continuous	query	
processing
Alerts
Manual	action,	
escalation
HISTORICAL	ANALYSIS
Data	Sheets
BI
Data	
Scientists
Cleansed
Data
History
Data	Discovery
Enterprise	Service	Bus
ERP MDM DB WMS
SOA
Data	Storage
Internal	Data
Integration	Bus
API
Event	Server
Machine	
Learning
Big	Data
ETL /
Data Ingestion
(Apache NiFi, Talend, …)
Streaming
Analytics
(Apache Flink, TIBCO StreamBase, …)
Data
Wrangling
(Trifacta, TIBCO Spotfire, …)
Data
Preparation
(R, Python, KNIME,
RapidMiner, …)
Big Data
Preparation
(MapReduce, Spark, …)
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
Dataset
https://www.kaggle.com/c/titanic
© Copyright 2000-2017 TIBCO Software Inc.
• create new column (extract)
• get title out of name (Mr., Mrs., Miss., Master., Other)
• create new column (aggregate)
• familiy size = 1+ SibSp + Parch
• create new column 'CabinFirstCharacter’
• extract the first character of the column 'cabin’
• remove duplicates in dataset
• add data to ‘NA’s (imputation)
• Age: ‘Average’ instead of ‘NA’ or discretize to bins;
• Cabin: Replace empty values with 'U' for Unknown
• use ‘data science functions’ to bring all data in a “similar shape” (e.g.
Scale / normalize / PCA / Box-Cox, …)
Examples for quality improvement and feature engineering
© Copyright 2000-2017 TIBCO Software Inc.
Overlapping!
ETL
Data
Wrangling
Streaming
Analytics
Data
Preprocessing
Big Data
Preparation
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
Frameworks for the Data Scientist
Many more ….
Programming
Language
Big Data
Framework
Deep Learning
Framework
© Copyright 2000-2017 TIBCO Software Inc.
• Built for the Data Scientist
• Includes data preprocessing functions (filter, extract, …)
• But also data science functions (scale, shuffle, PCA, …)
• Built for exploratory data analysis
• Focus on ”low level” coding
• Not built for enterprise scale deployment
• Commercial Enterprise Scale Runtime
• R: TIBCO Runtime for R (TERR), Microsoft R (former Revolution R)
Data Preprocessing with R
© Copyright 2000-2017 TIBCO Software Inc.
R
https://github.com/EasyD/IntroToDataScience
© Copyright 2000-2017 TIBCO Software Inc.
• Manipulate, clean and summarize unstructured data.
• Data manipulation operations such as applying filter, selecting specific
columns, sorting data, adding or deleting columns and aggregating data
• Very easy to learn and use dplyr functions
R Example: dplyr Package
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
© Copyright 2000-2017 TIBCO Software Inc.
• ’Data Science related’ Preprocessing (Center, scale, PCA, BoxCox, ...)
• Streamlines the model training process for complex regression and classification
problems
• Generic interface in front of hundreds of existing R model implementations (with
diverse APIs)
R Example: Caret Package
http://topepo.github.io/caret/index.html
Data Preprocessing with R
Live DemoLive Demo
© Copyright 2000-2017 TIBCO Software Inc.
• Built for the Developer and Data Scientist
• Built for processing big data (GB, TB, PB, …)
• Built-in elastic scalability
• Data processing at the edge (i.e. where the data is located)
• Commercial offerings
• Apache Hadoop / Spark: Hortonworks, Cloudera, MapR, Databricks …
• Focus on ”low level” coding
Data Preprocessing – Big Data Frameworks
© Copyright 2000-2017 TIBCO Software Inc.
Apache Spark
https://benfradet.github.io/blog/2015/12/16/Exploring-spark.ml-with-the-Titanic-Kaggle-competition
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
• Focus on ease-of-use and time-to-market / agility
• Development Environment + Runtime / Execution Server
• Visual “Coding”
• Code Generation
• Leverages Data Science frameworks like R or H2O.ai under
the hood respectively integrates them
• Leverages Big Data frameworks like Apache Hadoop or Spark
Data Preprocessing - by the (Citizen) Data Scientist
© Copyright 2000-2017 TIBCO Software Inc.
KNIME
https://www.linkedin.com/pulse/first-experience-knime-richard-soon
© Copyright 2000-2017 TIBCO Software Inc.
RapidMiner
https://rapidminer.com/resource/rapidminer-advanced-analytics-demonstration
© Copyright 2000-2017 TIBCO Software Inc.
RapidMiner
Filter Columns
Distance-based Outlier Detection
Easy Data Preparation:
• Many visual ML operators
• Intelligent recommendations
• Native Hadoop / Spark support
Data Preprocessing with RapidMiner
Live DemoLive Demo
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
• Built for “everybody” - Business Analyst or (Citizen) Data Scientist
• Focus on ease-of-use and time-to-market / agility
• e.g. DataWrangler, Trifacta, TIBCO Spotfire
Data Wrangling
Trifacta
Wrangler
Inline Data Wrangling within Visual Analytics Tooling
http://marketo.tibco.com/rs/221-BCQ-142/images/how-integrated-data-wrangling-fuels-analytic-creativity.pdf
“When analysts are in the middle of discovery, stopping everything
and going back to another tool is jarring. It breaks their flow. They
have to come back and pick up later. Productivity plummets and
creative energy crashes.”
• Inline-Data Wrangling during exploratory analysis of data
• All-in-one tooling; done by one single user
• AI-driven data wrangling and visualization
• e.g. TIBCO Spotfire
© Copyright 2000-2017 TIBCO Software Inc.
Inline Data Wrangling
Inline
Data Wrangling
=
Visual Interactive
Data Analysis
+
Data Preprocessing
in a Single Tool
© Copyright 2000-2017 TIBCO Software Inc.
TIBCO Spotfire
Inline Data Wrangling with TIBCO Spotfire
Live DemoLive Demo
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2016 TIBCO Software Inc.
Dataflow Pipeline – Extract, Transform, Load
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
• Built for the developer
• Focus on ease-of-use and enterprise deployments
• Focus on visual coding
• Focus on complex integration and data quality
• Support for big data frameworks like Apache Hadoop / Spark
© Copyright 2000-2017 TIBCO Software Inc.
Pentaho: Loading, transforming and cleaning Titanic data
http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Pentaho_Data_Integration.pdf
© Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
© Copyright 2000-2017 TIBCO Software Inc.
Streaming Analytics - Processing Pipeline
APIs
Adapters /
Channels
Integration
Messaging
Stream Ingest
Transformation
Aggregation
Enrichment
Filtering
Stream
Preprocessing
Process
Management
Analytics
(Real Time)
Applications
& APIs
Analytics /
DW Reporting
Stream
Outcomes
• Contextual Rules
• Windowing
• Patterns
• Analytics
• Deep ML
• …
Stream Analytics &
Processing
Index / SearchNormalization
Data Preprocessing
as piece of the puzzle
(batch or real time)
© Copyright 2000-2016 TIBCO Software Inc.
Dataflow Pipeline Frameworks
Streaming Analytics Frameworks and Products (no complete list!)
OPEN SOURCE CLOSED SOURCE
PRODUCT
FRAMEWORK
Azure Microsoft
Stream Analytics
http://www.kai-waehner.de/blog/2016/11/15/streaming-analytics-comparison-
open-source-frameworks-products-cloud-services/
© Copyright 2000-2017 TIBCO Software Inc.
TIBCO StreamBase: Loading, transforming and cleaning Titanic data
Data Preprocessing with TIBCO StreamBase
Live DemoLive Demo
© Copyright 2000-2017 TIBCO Software Inc.
Key Takeaways
Ø Various languages, frameworks and tools for data preparation - trade-offs included
Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool
Ø Visual analytics and open source data science components are complementary
Ø Avoiding numerous components speeds up a data science project
… for Data Preparation in Data Science:
Questions? Please contact me!
Kai Wähner
Technology Evangelist
kontakt@kai-waehner.de
@KaiWaehner
www.kai-waehner.de
LinkedIn
Ad

More Related Content

What's hot (20)

NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
Mohammed Fazuluddin
 
Emerging Strategies for a Proactive Library Management,
Emerging Strategies for a Proactive Library Management,Emerging Strategies for a Proactive Library Management,
Emerging Strategies for a Proactive Library Management,
Fe Angela Verzosa
 
Hot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesisHot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesis
WriteMyThesis
 
Presentation federated search
Presentation federated searchPresentation federated search
Presentation federated search
Dr. Shakuntala Nighot
 
Library Information System
Library Information System Library Information System
Library Information System
Booktec LibBest
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
Gianluca Bontempi
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
HP-UX Swap and Dump Unleashed by Dusan Baljevic
HP-UX Swap and Dump Unleashed by Dusan BaljevicHP-UX Swap and Dump Unleashed by Dusan Baljevic
HP-UX Swap and Dump Unleashed by Dusan Baljevic
Circling Cycle
 
The concept of information seeking behavior by using Wilsons’ (1996) revised ...
The concept of information seeking behavior by using Wilsons’ (1996) revised ...The concept of information seeking behavior by using Wilsons’ (1996) revised ...
The concept of information seeking behavior by using Wilsons’ (1996) revised ...
Lucy Kasuke
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
Maynooth University
 
Unit 3
Unit 3Unit 3
Unit 3
vishal choudhary
 
Usage of helpful sequence in cc(colon classification)
Usage of helpful sequence in cc(colon classification) Usage of helpful sequence in cc(colon classification)
Usage of helpful sequence in cc(colon classification)
Prakash Das
 
Digital Library
 Digital Library Digital Library
Digital Library
Shiv Kumar
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
Madhu Sanjeevi (Mady)
 
Subject Indexing & Techniques
Subject Indexing  & TechniquesSubject Indexing  & Techniques
Subject Indexing & Techniques
Dr. Utpal Das
 
Introduction to Controlled Vocabulary
Introduction to Controlled VocabularyIntroduction to Controlled Vocabulary
Introduction to Controlled Vocabulary
Rebecca Thompson
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Emerging Strategies for a Proactive Library Management,
Emerging Strategies for a Proactive Library Management,Emerging Strategies for a Proactive Library Management,
Emerging Strategies for a Proactive Library Management,
Fe Angela Verzosa
 
Hot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesisHot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesis
WriteMyThesis
 
Library Information System
Library Information System Library Information System
Library Information System
Booktec LibBest
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
Gianluca Bontempi
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
HP-UX Swap and Dump Unleashed by Dusan Baljevic
HP-UX Swap and Dump Unleashed by Dusan BaljevicHP-UX Swap and Dump Unleashed by Dusan Baljevic
HP-UX Swap and Dump Unleashed by Dusan Baljevic
Circling Cycle
 
The concept of information seeking behavior by using Wilsons’ (1996) revised ...
The concept of information seeking behavior by using Wilsons’ (1996) revised ...The concept of information seeking behavior by using Wilsons’ (1996) revised ...
The concept of information seeking behavior by using Wilsons’ (1996) revised ...
Lucy Kasuke
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
Maynooth University
 
Usage of helpful sequence in cc(colon classification)
Usage of helpful sequence in cc(colon classification) Usage of helpful sequence in cc(colon classification)
Usage of helpful sequence in cc(colon classification)
Prakash Das
 
Digital Library
 Digital Library Digital Library
Digital Library
Shiv Kumar
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
Madhu Sanjeevi (Mady)
 
Subject Indexing & Techniques
Subject Indexing  & TechniquesSubject Indexing  & Techniques
Subject Indexing & Techniques
Dr. Utpal Das
 
Introduction to Controlled Vocabulary
Introduction to Controlled VocabularyIntroduction to Controlled Vocabulary
Introduction to Controlled Vocabulary
Rebecca Thompson
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 

Viewers also liked (20)

Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
Role of Analytics in Consumer Packaged Goods Industry
Role of Analytics in Consumer Packaged Goods IndustryRole of Analytics in Consumer Packaged Goods Industry
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
How PepsiCo's Big Data Strategy is Disrupting CPG Retail AnalyticsHow PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
Mike Hucka
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-PlattformAnalytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Rising Media Ltd.
 
The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4
turingfan
 
Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102
remko caprio
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
Krupali Gandhi
 
MongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open SourceMongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open Source
B1 Systems GmbH
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)
Annex Publishers
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
Korkrid Akepanidtaworn
 
Job ppt1
Job ppt1Job ppt1
Job ppt1
aumkarpraja
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
Levi Waldron
 
Day in the Life of a Computer Scientist
Day in the Life of a Computer ScientistDay in the Life of a Computer Scientist
Day in the Life of a Computer Scientist
Justin Brunelle
 
Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?
IoT User Group Hamburg
 
Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen
Harald Erb
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
Role of Analytics in Consumer Packaged Goods Industry
Role of Analytics in Consumer Packaged Goods IndustryRole of Analytics in Consumer Packaged Goods Industry
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
How PepsiCo's Big Data Strategy is Disrupting CPG Retail AnalyticsHow PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
Mike Hucka
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-PlattformAnalytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Rising Media Ltd.
 
The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4
turingfan
 
Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102
remko caprio
 
MongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open SourceMongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open Source
B1 Systems GmbH
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)
Annex Publishers
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
Levi Waldron
 
Day in the Life of a Computer Scientist
Day in the Life of a Computer ScientistDay in the Life of a Computer Scientist
Day in the Life of a Computer Scientist
Justin Brunelle
 
Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?
IoT User Group Hamburg
 
Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen
Harald Erb
 
Ad

Similar to Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning (20)

Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Matt Stubbs
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
Big Data Spain
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
Denodo
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
Big Data Meetup: Analytical Systems Evolution
Big Data Meetup: Analytical Systems EvolutionBig Data Meetup: Analytical Systems Evolution
Big Data Meetup: Analytical Systems Evolution
Provectus
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
Inside Analysis
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
Denodo
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Houston Energy Data Science Meet up_TIBCO Slides
Houston Energy Data Science Meet up_TIBCO SlidesHouston Energy Data Science Meet up_TIBCO Slides
Houston Energy Data Science Meet up_TIBCO Slides
Jennifer Walsh
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
DATAVERSITY
 
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Matt Stubbs
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
Big Data Spain
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
Denodo
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
Big Data Meetup: Analytical Systems Evolution
Big Data Meetup: Analytical Systems EvolutionBig Data Meetup: Analytical Systems Evolution
Big Data Meetup: Analytical Systems Evolution
Provectus
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
Inside Analysis
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
Denodo
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Houston Energy Data Science Meet up_TIBCO Slides
Houston Energy Data Science Meet up_TIBCO SlidesHouston Energy Data Science Meet up_TIBCO Slides
Houston Energy Data Science Meet up_TIBCO Slides
Jennifer Walsh
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
DATAVERSITY
 
Ad

More from Kai Wähner (20)

Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Kai Wähner
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
Kai Wähner
 
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kai Wähner
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Kai Wähner
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Kai Wähner
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryData Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Apache Kafka for Real-time Supply Chainin the Food and Retail IndustryApache Kafka for Real-time Supply Chainin the Food and Retail Industry
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Kai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Kai Wähner
 
Apache Kafka Landscape for Automotive and Manufacturing
Apache Kafka Landscape for Automotive and ManufacturingApache Kafka Landscape for Automotive and Manufacturing
Apache Kafka Landscape for Automotive and Manufacturing
Kai Wähner
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Kai Wähner
 
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Kai Wähner
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Kai Wähner
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
Kai Wähner
 
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kai Wähner
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Kai Wähner
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Kai Wähner
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryData Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Apache Kafka for Real-time Supply Chainin the Food and Retail IndustryApache Kafka for Real-time Supply Chainin the Food and Retail Industry
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Kai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Kai Wähner
 
Apache Kafka Landscape for Automotive and Manufacturing
Apache Kafka Landscape for Automotive and ManufacturingApache Kafka Landscape for Automotive and Manufacturing
Apache Kafka Landscape for Automotive and Manufacturing
Kai Wähner
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Kai Wähner
 
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Kai Wähner
 

Recently uploaded (20)

Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 

Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning

  • 1. Kai Wähner Technology Evangelist [email protected] LinkedIn @KaiWaehner www.kai-waehner.de February 2017 Data Preprocessing vs. Data Wrangling in Machine Learning / Deep Learning Projects
  • 2. © Copyright 2000-2017 TIBCO Software Inc. A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 50% of the whole project. This session compares different alternative techniques to prepare data, including extract- transform-load (ETL) batch processing, streaming analytics ingestion, and data wrangling within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Spark, Talend or KNIME. The session also discusses how this is related to visual analytics, and best practices for how the data scientist and business user should work together to build good analytic models. Key takeaways for the audience: - Learn various options for preparing data sets to build analytic models - Understand the pros and cons and the targeted persona for each option - See different technologies and open source frameworks for data preparation - Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation Comparison of Data Preprocessing vs. Data Wrangling vs. ETL vs. Streaming Ingestion in Machine Learning / Deep Learning Projects
  • 3. © Copyright 2000-2017 TIBCO Software Inc. Key Takeaways Ø Various languages, frameworks and tools for data preparation - trade-offs included Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool Ø Visual analytics and open source data science components are complementary Ø Avoiding numerous components speeds up a data science project … for Data Preparation in Data Science:
  • 4. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 5. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 6. © Copyright 2000-2017 TIBCO Software Inc. From Insight to Action - Closed Loop for Big Data Analytics Insight ActionEVENTSEVENTS
  • 7. © Copyright 2000-2017 TIBCO Software Inc. From Insight to Action - Closed Loop for Big Data Analytics Insight Action MONITOR PREDICT ACT DECIDE MODEL ACCESS ANALYZE WRANGLE
  • 8. © Copyright 2000-2017 TIBCO Software Inc. Analyst Reports 2016 Magic Quadrant for Advanced Analytics Platforms The Forrester Wave: Enterprise Insight Platform SuitesMagic Quadrant for Data Integration Tools Magic Quadrant for BI and Analytics
  • 9. © Copyright 2000-2017 TIBCO Software Inc. Demystify Data Science for the Business Analyst Leverage Machine Learning without help of a Data Scientist
  • 10. © Copyright 2000-2017 TIBCO Software Inc. • Business User / Analyst • Data Scientist • Citizen Data Scientist • Developer User Roles AI-DRIVEN VISUAL ANALYTICS DATA DISCOVERY DASHBOARDS DATA SCIENCE RE-IMAGINED PREDICTIVE MACHINE LEARNING STREAMING ANALYTICS REAL TIME ACTIONABLE
  • 11. © Copyright 2000-2017 TIBCO Software Inc. • “The heart of data science” • Domain knowledge is very important • Often takes 60% to 80% of the whole analytical pipeline • Get the best accuracy from machine learning algorithms on your datasets • Cannot be fully automated (at least not in the beginning) Data Preparation http://www.slideshare.net/odsc/feature-engineering Data Preparation
  • 12. © Copyright 2000-2017 TIBCO Software Inc. • Basics (select, filter, removal of duplicates, …) • Sampling (balanced, stratisfied, ...) • Data Partitioning (create training + validation + test data set, ...) • Transformations (normalisation, standardisation, scaling, pivoting, ...) • Binning (count-based, handling of missing values as its own group, …) • Data Replacement (cutting, splitting, merging, ...) • Weighting and Selection (attribute weighting, automatic optimization, ...) • Attribute Generation (ID generation, ...) • Imputation (replacement of missing observations by using statistical algorithms) Data Cleaning
  • 13. © Copyright 2000-2017 TIBCO Software Inc. • Using domain knowledge of the data to create features that make machine learning algorithms work • Fundamental to the application of machine learning • Both difficult and expensive • Part of Model Building, but also includes Data Preparation Feature Engineering The process of feature engineering • Brainstorming Or Testing features • Deciding what features to create • Creating features • Checking how the features work with your model • Improving your features if needed • Go back to brainstorming/creating more features until the work is done
  • 14. © Copyright 2000-2017 TIBCO Software Inc. Analytical Pipeline 1. Data Access 2. Data Preprocessing 3. Exploratory Data Analysis 4. Model Building 5. Model Validation 6. Model Execution 7. Deployment
  • 15. © Copyright 2000-2017 TIBCO Software Inc. Google Trends
  • 16. © Copyright 2000-2017 TIBCO Software Inc. Data Preparation in the Analytical Pipeline 1. Data Access 2. Data Preprocessing 3. Exploratory Data Analysis 4. Model Building 5. Model Validation 6. Model Execution 7. Deployment Data Preprocessing + Data Wrangling = Success
  • 17. Reference Architecture for Big Data Analytics Operational Analytics OperationsLive UI SENSOR DATA TRANSACTIONS MESSAGE BUS MACHINE DATA SOCIAL DATA Streaming AnalyticsAction Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Machine Learning Big Data
  • 18. Reference Architecture for Big Data Analytics Operational Analytics OperationsLive UI SENSOR DATA TRANSACTIONS MESSAGE BUS MACHINE DATA SOCIAL DATA Streaming AnalyticsAction Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Enterprise Service Bus ERP MDM DB WMS SOA Data Storage Internal Data Integration Bus API Event Server Machine Learning Big Data ETL / Data Ingestion (Apache NiFi, Talend, …) Streaming Analytics (Apache Flink, TIBCO StreamBase, …) Data Wrangling (Trifacta, TIBCO Spotfire, …) Data Preparation (R, Python, KNIME, RapidMiner, …) Big Data Preparation (MapReduce, Spark, …)
  • 19. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 20. © Copyright 2000-2017 TIBCO Software Inc. Dataset https://www.kaggle.com/c/titanic
  • 21. © Copyright 2000-2017 TIBCO Software Inc. • create new column (extract) • get title out of name (Mr., Mrs., Miss., Master., Other) • create new column (aggregate) • familiy size = 1+ SibSp + Parch • create new column 'CabinFirstCharacter’ • extract the first character of the column 'cabin’ • remove duplicates in dataset • add data to ‘NA’s (imputation) • Age: ‘Average’ instead of ‘NA’ or discretize to bins; • Cabin: Replace empty values with 'U' for Unknown • use ‘data science functions’ to bring all data in a “similar shape” (e.g. Scale / normalize / PCA / Box-Cox, …) Examples for quality improvement and feature engineering
  • 22. © Copyright 2000-2017 TIBCO Software Inc. Overlapping! ETL Data Wrangling Streaming Analytics Data Preprocessing Big Data Preparation
  • 23. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 24. Frameworks for the Data Scientist Many more …. Programming Language Big Data Framework Deep Learning Framework
  • 25. © Copyright 2000-2017 TIBCO Software Inc. • Built for the Data Scientist • Includes data preprocessing functions (filter, extract, …) • But also data science functions (scale, shuffle, PCA, …) • Built for exploratory data analysis • Focus on ”low level” coding • Not built for enterprise scale deployment • Commercial Enterprise Scale Runtime • R: TIBCO Runtime for R (TERR), Microsoft R (former Revolution R) Data Preprocessing with R
  • 26. © Copyright 2000-2017 TIBCO Software Inc. R https://github.com/EasyD/IntroToDataScience
  • 27. © Copyright 2000-2017 TIBCO Software Inc. • Manipulate, clean and summarize unstructured data. • Data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data • Very easy to learn and use dplyr functions R Example: dplyr Package https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
  • 28. © Copyright 2000-2017 TIBCO Software Inc. • ’Data Science related’ Preprocessing (Center, scale, PCA, BoxCox, ...) • Streamlines the model training process for complex regression and classification problems • Generic interface in front of hundreds of existing R model implementations (with diverse APIs) R Example: Caret Package http://topepo.github.io/caret/index.html
  • 29. Data Preprocessing with R Live DemoLive Demo
  • 30. © Copyright 2000-2017 TIBCO Software Inc. • Built for the Developer and Data Scientist • Built for processing big data (GB, TB, PB, …) • Built-in elastic scalability • Data processing at the edge (i.e. where the data is located) • Commercial offerings • Apache Hadoop / Spark: Hortonworks, Cloudera, MapR, Databricks … • Focus on ”low level” coding Data Preprocessing – Big Data Frameworks
  • 31. © Copyright 2000-2017 TIBCO Software Inc. Apache Spark https://benfradet.github.io/blog/2015/12/16/Exploring-spark.ml-with-the-Titanic-Kaggle-competition
  • 32. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 33. © Copyright 2000-2017 TIBCO Software Inc. • Focus on ease-of-use and time-to-market / agility • Development Environment + Runtime / Execution Server • Visual “Coding” • Code Generation • Leverages Data Science frameworks like R or H2O.ai under the hood respectively integrates them • Leverages Big Data frameworks like Apache Hadoop or Spark Data Preprocessing - by the (Citizen) Data Scientist
  • 34. © Copyright 2000-2017 TIBCO Software Inc. KNIME https://www.linkedin.com/pulse/first-experience-knime-richard-soon
  • 35. © Copyright 2000-2017 TIBCO Software Inc. RapidMiner https://rapidminer.com/resource/rapidminer-advanced-analytics-demonstration
  • 36. © Copyright 2000-2017 TIBCO Software Inc. RapidMiner Filter Columns Distance-based Outlier Detection Easy Data Preparation: • Many visual ML operators • Intelligent recommendations • Native Hadoop / Spark support
  • 37. Data Preprocessing with RapidMiner Live DemoLive Demo
  • 38. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 39. © Copyright 2000-2017 TIBCO Software Inc. • Built for “everybody” - Business Analyst or (Citizen) Data Scientist • Focus on ease-of-use and time-to-market / agility • e.g. DataWrangler, Trifacta, TIBCO Spotfire Data Wrangling Trifacta Wrangler
  • 40. Inline Data Wrangling within Visual Analytics Tooling http://marketo.tibco.com/rs/221-BCQ-142/images/how-integrated-data-wrangling-fuels-analytic-creativity.pdf “When analysts are in the middle of discovery, stopping everything and going back to another tool is jarring. It breaks their flow. They have to come back and pick up later. Productivity plummets and creative energy crashes.” • Inline-Data Wrangling during exploratory analysis of data • All-in-one tooling; done by one single user • AI-driven data wrangling and visualization • e.g. TIBCO Spotfire
  • 41. © Copyright 2000-2017 TIBCO Software Inc. Inline Data Wrangling Inline Data Wrangling = Visual Interactive Data Analysis + Data Preprocessing in a Single Tool
  • 42. © Copyright 2000-2017 TIBCO Software Inc. TIBCO Spotfire
  • 43. Inline Data Wrangling with TIBCO Spotfire Live DemoLive Demo
  • 44. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 45. © Copyright 2000-2016 TIBCO Software Inc. Dataflow Pipeline – Extract, Transform, Load https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini • Built for the developer • Focus on ease-of-use and enterprise deployments • Focus on visual coding • Focus on complex integration and data quality • Support for big data frameworks like Apache Hadoop / Spark
  • 46. © Copyright 2000-2017 TIBCO Software Inc. Pentaho: Loading, transforming and cleaning Titanic data http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Pentaho_Data_Integration.pdf
  • 47. © Copyright 2000-2017 TIBCO Software Inc. Agenda 1) The Need for Data Preprocessing and Data Wrangling 2) Kaggle’s Titanic Dataset 3) Data Preprocessing - by the Data Scientist 4) Data Preprocessing - by the (Citizen) Data Scientist 5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist 6) ETL and DQ - by the Developer 7) Data Ingestion and Streaming Analytics - by the Developer
  • 48. © Copyright 2000-2017 TIBCO Software Inc. Streaming Analytics - Processing Pipeline APIs Adapters / Channels Integration Messaging Stream Ingest Transformation Aggregation Enrichment Filtering Stream Preprocessing Process Management Analytics (Real Time) Applications & APIs Analytics / DW Reporting Stream Outcomes • Contextual Rules • Windowing • Patterns • Analytics • Deep ML • … Stream Analytics & Processing Index / SearchNormalization Data Preprocessing as piece of the puzzle (batch or real time)
  • 49. © Copyright 2000-2016 TIBCO Software Inc. Dataflow Pipeline Frameworks
  • 50. Streaming Analytics Frameworks and Products (no complete list!) OPEN SOURCE CLOSED SOURCE PRODUCT FRAMEWORK Azure Microsoft Stream Analytics http://www.kai-waehner.de/blog/2016/11/15/streaming-analytics-comparison- open-source-frameworks-products-cloud-services/
  • 51. © Copyright 2000-2017 TIBCO Software Inc. TIBCO StreamBase: Loading, transforming and cleaning Titanic data
  • 52. Data Preprocessing with TIBCO StreamBase Live DemoLive Demo
  • 53. © Copyright 2000-2017 TIBCO Software Inc. Key Takeaways Ø Various languages, frameworks and tools for data preparation - trade-offs included Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool Ø Visual analytics and open source data science components are complementary Ø Avoiding numerous components speeds up a data science project … for Data Preparation in Data Science:
  • 54. Questions? Please contact me! Kai Wähner Technology Evangelist [email protected] @KaiWaehner www.kai-waehner.de LinkedIn