SlideShare a Scribd company logo
Hadoop User Group London: Data Wrangling on Hadoop
September 8 2016
Olivier de Garrigues, EMEA Solutions Lead
Creating radical productivity
for people who analyze data.
JEFFREY HEER
Co-Founder & CXO
VISUALIZATION
JOE HELLERSTEIN
Co-Founder & CSO
BIG DATA
SEAN KANDEL
Co-Founder & CTO
HUMAN-COMPUTER INTERACTION
3
3,000+ Companies 10,000+ Users
What is Data Wrangling?
4
QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
The Bridge Between Raw Data & Analysis
5
v
Ingestion Storage Processing
ANALYSIS & VISUALIZATION
LOBCLEANING ENRICHMENT DISTILLATIONSTRUCTURINGDISCOVERY
End-User Capabilities
IT
GOVERNANCE INTEGRATION AVAILABILTIYSCALABILITYSECURITY
Technical Capabilities
Conventional Approaches Inhibit User Empowerment
Hand-Coding Technical Workflow Mapping
Trifacta Approach: It’s All About The Experience
Interact Predict
Preview
Data Wrangling
for Financial Fraud
TRIFACTA
DATA WRANGLING WORKFLOW
Trifacta. Confidential & Proprietary.
Sample Scale Up
Refine
Sample
Results
Identify/Register Data
1.
Predictive Interaction
2
.
Consume
Schedulers
Monitor and Adjust
3
.
Schedule
Visualization & Analysis
Secure Access
Ingestion Processing Storage
ANALYSIS & CONSUMPTION
v
Discover Structure Clean Enrich Distill
LOB
IT
News
Topics
Time
Trades
Tickers
Date
$
eMails
Recipients
Topics
Phone Logs
Call Details
Recipients
Corporations
Company Relations
Individuals
Financial Services use case: Trader Fraud
Data Wrangling Benefits
➔  Empower the people who know the data best
➔  Accelerate time to value
➔  Lower business risk with more accurate data
➔  Unlock innovation using a wider variety of data

More Related Content

What's hot (13)

PPTX
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
DataWorks Summit
 
PDF
Data Science Application in Business Portfolio & Risk Management
Data Science Thailand
 
PDF
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
PPTX
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps, a CSC Big Data Business
 
PPTX
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
PPTX
Importance of Big Data Analytics
Impetus Technologies
 
PDF
Introduction to Neo4j
Neo4j
 
PDF
Big Data Scotland 2017
Ray Bugg
 
PPTX
Big Data Ecosystem
Ivo Vachkov
 
PDF
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Neo4j
 
PDF
Big data ecosystem
magda3695
 
PPTX
Webinar - Fighting Bank Fraud with Real-time Graph Database
DataStax
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
DataWorks Summit
 
Data Science Application in Business Portfolio & Risk Management
Data Science Thailand
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps, a CSC Big Data Business
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
Importance of Big Data Analytics
Impetus Technologies
 
Introduction to Neo4j
Neo4j
 
Big Data Scotland 2017
Ray Bugg
 
Big Data Ecosystem
Ivo Vachkov
 
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Neo4j
 
Big data ecosystem
magda3695
 
Webinar - Fighting Bank Fraud with Real-time Graph Database
DataStax
 

Viewers also liked (8)

PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
PDF
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
PPTX
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
PDF
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
PDF
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
PDF
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 
Ad

More from huguk (20)

PDF
ether.camp - Hackathon & ether.camp intro
huguk
 
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
PDF
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
PDF
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
PDF
Streaming Dataflow with Apache Flink
huguk
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
PDF
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
PDF
Signal Media: Real-Time Media & News Monitoring
huguk
 
PDF
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
PDF
Peter Karney: Intro to the Digital catapult
huguk
 
PDF
Cytora: Real-Time Political Risk Analysis
huguk
 
PDF
Cubitic: Predictive Analytics
huguk
 
PDF
Bird.i: Earth Observation Data Made Social
huguk
 
PDF
Aiseedo: Real Time Machine Intelligence
huguk
 
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
PDF
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
PPTX
Hadoop - Looking to the Future By Arun Murthy
huguk
 
PDF
Fast real-time approximations using Spark streaming
huguk
 
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
ether.camp - Hackathon & ether.camp intro
huguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
Streaming Dataflow with Apache Flink
huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
Signal Media: Real-Time Media & News Monitoring
huguk
 
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
Peter Karney: Intro to the Digital catapult
huguk
 
Cytora: Real-Time Political Risk Analysis
huguk
 
Cubitic: Predictive Analytics
huguk
 
Bird.i: Earth Observation Data Made Social
huguk
 
Aiseedo: Real Time Machine Intelligence
huguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
Hadoop - Looking to the Future By Arun Murthy
huguk
 
Fast real-time approximations using Spark streaming
huguk
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
Ad

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta

  • 1. Hadoop User Group London: Data Wrangling on Hadoop September 8 2016 Olivier de Garrigues, EMEA Solutions Lead
  • 2. Creating radical productivity for people who analyze data. JEFFREY HEER Co-Founder & CXO VISUALIZATION JOE HELLERSTEIN Co-Founder & CSO BIG DATA SEAN KANDEL Co-Founder & CTO HUMAN-COMPUTER INTERACTION
  • 4. What is Data Wrangling? 4 QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
  • 5. The Bridge Between Raw Data & Analysis 5 v Ingestion Storage Processing ANALYSIS & VISUALIZATION LOBCLEANING ENRICHMENT DISTILLATIONSTRUCTURINGDISCOVERY End-User Capabilities IT GOVERNANCE INTEGRATION AVAILABILTIYSCALABILITYSECURITY Technical Capabilities
  • 6. Conventional Approaches Inhibit User Empowerment Hand-Coding Technical Workflow Mapping
  • 7. Trifacta Approach: It’s All About The Experience Interact Predict Preview
  • 9. TRIFACTA DATA WRANGLING WORKFLOW Trifacta. Confidential & Proprietary. Sample Scale Up Refine Sample Results Identify/Register Data 1. Predictive Interaction 2 . Consume Schedulers Monitor and Adjust 3 . Schedule Visualization & Analysis Secure Access
  • 10. Ingestion Processing Storage ANALYSIS & CONSUMPTION v Discover Structure Clean Enrich Distill LOB IT News Topics Time Trades Tickers Date $ eMails Recipients Topics Phone Logs Call Details Recipients Corporations Company Relations Individuals Financial Services use case: Trader Fraud
  • 11. Data Wrangling Benefits ➔  Empower the people who know the data best ➔  Accelerate time to value ➔  Lower business risk with more accurate data ➔  Unlock innovation using a wider variety of data