SlideShare a Scribd company logo
1
Scaling ETL
with Hadoop
Gwen Shapira
@gwenshap
gshapira@cloudera.com
Coming soon to a bookstore near you…
• Hadoop Application
Architectures
How to build end-to-end solutions
using Apache Hadoop and related
tools
@hadooparchbook
www.hadooparchitecturebook.com
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
Hadoop Is…
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways to process data in parallel at scale
4
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
6
Why ETL with Hadoop?
Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
Volume, Variety, Velocity Cause Problems
8
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1
1
1
Slow data transformations. Missed SLAs.
2
2
Slow queries. Frustrated business and IT.
3 Must archive. Archived data can’t provide value.
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
9
What is Apache Hadoop?
10
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides workloads
across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Distributed
 Fault tolerant
 Scalable
CORE HADOOP SYSTEM COMPONENTS
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
12
Best Practices
Arup Nanda taught me to ask:
1. Why is it better than the rest?
2. What happens if it is not followed?
3. When are they not applicable?
13
14
15
Extract
Let me count the ways
1. From Databases: Sqoop
2. Log Data: Flume
3. Copy data to HDFS
16
Data Loading Mistake #1
17
Hadoop is scalable.
Lets run as many Sqoop mappers
as possible, to get the data from
our DB faster!
— Famous last words
Result:
18
Lesson:
• Start with 2 mappers, add slowly
• Watch DB load and network utilization
• Use FairScheduler to limit number of mappers
19
Data Loading Mistake #2
20
Database specific connectors are
complicated and scary.
Lets just use the default JDBC
connector.
— Famous last words
Result:
21
Lesson:
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions if not clear
5. Follow installation instructions
6. Use Sqoop with connectors
22
Data Loading Mistake #3
23
Just copying files?
This sounds too simple. We
probably need some cool
whizzbang tool.
— Famous last words
Result
24
Lessons:
• Copying files is a legitimate solution
• In general, simple is good
25
26
Transform
Endless Possibilities
• Map Reduce
• Crunch / Cascading
• Spark
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
27
Data Processing Mistake #0
28
Data Processing Mistake #1
29
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transformations to Hadoop.
Lets spend 2 days planning a schedule and budget
for the entire year and then just go and implement
it.
Prototype? Who needs that?
— Famous last words
Result
30
Lessons
• Take learning curve into account
• You don’t know what you don’t know
• Hadoop will be difficult and frustrating
for at least 3 month.
31
Data Processing Mistake #2
32
Hadoop is all about MapReduce.
So I’ll use MapReduce for all my
data processing needs.
— Famous last words
Result:
33
Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.
Hard things are possible.
34
Data Processing Mistake #3
35
I got 5000 tiny XMLs, and Hadoop
is great at processing
unstructured data.
So I’ll just leave the data like that
and parse the XML in every job.
— Famous last words
Result
36
Lessons
1. Consolidate small files
2. Don’t argue about #1
3. Convert files to easy-to-query formats
4. De-normalize
37
Data Processing Mistake #4
38
Partitions are for relational databases
— Famous last words
Result
39
Lessons
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one you don’t perform
4. Cheap storage allows you to store same dataset,
partitioned multiple ways.
5. Use partitions for fast data loading
40
41
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• Just copy files
• Query Hadoop
42
Data Loading Mistake #1
43
All of the data must end up in a
relational DWH.
— Famous last words
Result
44
Lessons:
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph analysis
• Reduce time in pipeline
• Big data & small network
• Congested database
45
Data Loading Mistake #2
46
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.
— Famous last words
Result
47
Lesson
Use Oracle direct connectors if you can afford them.
They are:
1. Faster than any alternative
2. Use Hadoop to make Oracle more efficient
3. Make *you* more efficient
48
49
Workflow Management
Tools
• Oozie
• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,
UC4, Cron
50
Workflow Mistake #1
51
Workflow management is easy. I’ll just
write few scripts.
— Famous last words
52
— Josh Wills
Lesson:
Workflow management tool should enable:
• Keeping track of metadata, components and
integrations
• Scheduling and Orchestration
• Restarts and retries
• Cohesive System View
• Instrumentation, Measurement and Monitoring
• Reporting
53
Workflow Mistake #2
54
Schema? This is Hadoop. Why would
we need a schema?
— Famous last words
Result
55
Lesson
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partition
/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>
/etl/pharmacy/fraud/orders/validated
56
Workflow Mistake #3
57
Oozie was written for Hadoop, so the
right solution will always use Oozie
— Famous last words
Result
58
Lessons:
• Oozie has advantages
• Use the tool that works for you
59
Hue + Oozie
60
61
— Neil Gaiman
62
— Esther Dyson
63
Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == Business pays attention to data
• New skills – from coding to cluster administration
• Interesting projects
• No, you don’t need to learn Java
64
Beginner Projects
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of Shakespeare
• Movielens database
• Find the 10 most common words in Shakespeare
• Find the 10 most recommended movies
• Run TPC-H
• Cloudera Data Science Challenge
• Actual use-case:
XML ingestion, ETL process, DWH history
65
Books
66
More Books
67
Ad

More Related Content

What's hot (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 

Similar to Scaling ETL with Hadoop - Avoiding Failure (20)

Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
TIB Academy
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
TIB Academy
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
Harsha Siva Sai
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
Hadoop
HadoopHadoop
Hadoop
thisisnabin
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
MODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptxMODULE 1: Introduction to Big Data Analytics.pptx
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
TIB Academy
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
TIB Academy
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
Harsha Siva Sai
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
Umair Shafique
 
Ad

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Ssd collab13
Ssd   collab13Ssd   collab13
Ssd collab13
Gwen (Chen) Shapira
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
Gwen (Chen) Shapira
 
Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Ad

Recently uploaded (20)

Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key  With LatestAdobe Photoshop CC 2025 Crack Full Serial Key  With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
usmanhidray
 
Salesforce Aged Complex Org Revitalization Process .pdf
Salesforce Aged Complex Org Revitalization Process .pdfSalesforce Aged Complex Org Revitalization Process .pdf
Salesforce Aged Complex Org Revitalization Process .pdf
SRINIVASARAO PUSULURI
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Illustrator Crack | Free Download & Install Illustrator
Adobe Illustrator Crack | Free Download & Install IllustratorAdobe Illustrator Crack | Free Download & Install Illustrator
Adobe Illustrator Crack | Free Download & Install Illustrator
usmanhidray
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Adobe Photoshop Lightroom CC 2025 Crack Latest Version
Adobe Photoshop Lightroom CC 2025 Crack Latest VersionAdobe Photoshop Lightroom CC 2025 Crack Latest Version
Adobe Photoshop Lightroom CC 2025 Crack Latest Version
usmanhidray
 
Xforce Keygen 64-bit AutoCAD 2025 Crack
Xforce Keygen 64-bit AutoCAD 2025  CrackXforce Keygen 64-bit AutoCAD 2025  Crack
Xforce Keygen 64-bit AutoCAD 2025 Crack
usmanhidray
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Sales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptxSales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptx
EliandoLawnote
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key  With LatestAdobe Photoshop CC 2025 Crack Full Serial Key  With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
usmanhidray
 
Salesforce Aged Complex Org Revitalization Process .pdf
Salesforce Aged Complex Org Revitalization Process .pdfSalesforce Aged Complex Org Revitalization Process .pdf
Salesforce Aged Complex Org Revitalization Process .pdf
SRINIVASARAO PUSULURI
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Illustrator Crack | Free Download & Install Illustrator
Adobe Illustrator Crack | Free Download & Install IllustratorAdobe Illustrator Crack | Free Download & Install Illustrator
Adobe Illustrator Crack | Free Download & Install Illustrator
usmanhidray
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Adobe Photoshop Lightroom CC 2025 Crack Latest Version
Adobe Photoshop Lightroom CC 2025 Crack Latest VersionAdobe Photoshop Lightroom CC 2025 Crack Latest Version
Adobe Photoshop Lightroom CC 2025 Crack Latest Version
usmanhidray
 
Xforce Keygen 64-bit AutoCAD 2025 Crack
Xforce Keygen 64-bit AutoCAD 2025  CrackXforce Keygen 64-bit AutoCAD 2025  Crack
Xforce Keygen 64-bit AutoCAD 2025 Crack
usmanhidray
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Sales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptxSales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptx
EliandoLawnote
 

Scaling ETL with Hadoop - Avoiding Failure

  • 2. Coming soon to a bookstore near you… • Hadoop Application Architectures How to build end-to-end solutions using Apache Hadoop and related tools @hadooparchbook www.hadooparchitecturebook.com
  • 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4. Hadoop Is… • HDFS – Massive, redundant data storage • MapReduce – Batch oriented data processing at scale • Many many ways to process data in parallel at scale 4
  • 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6. 6 Why ETL with Hadoop?
  • 7. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
  • 8. Volume, Variety, Velocity Cause Problems 8 OLTP Enterprise Applications Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 1 1 Slow data transformations. Missed SLAs. 2 2 Slow queries. Frustrated business and IT. 3 Must archive. Archived data can’t provide value.
  • 9. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  • 10. What is Apache Hadoop? 10 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  • 11. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11
  • 12. 12
  • 13. Best Practices Arup Nanda taught me to ask: 1. Why is it better than the rest? 2. What happens if it is not followed? 3. When are they not applicable? 13
  • 14. 14
  • 16. Let me count the ways 1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS 16
  • 17. Data Loading Mistake #1 17 Hadoop is scalable. Lets run as many Sqoop mappers as possible, to get the data from our DB faster! — Famous last words
  • 19. Lesson: • Start with 2 mappers, add slowly • Watch DB load and network utilization • Use FairScheduler to limit number of mappers 19
  • 20. Data Loading Mistake #2 20 Database specific connectors are complicated and scary. Lets just use the default JDBC connector. — Famous last words
  • 22. Lesson: 1. There are connectors to: Oracle, Netezza and Teradata 2. Download them 3. Read documentation 4. Ask questions if not clear 5. Follow installation instructions 6. Use Sqoop with connectors 22
  • 23. Data Loading Mistake #3 23 Just copying files? This sounds too simple. We probably need some cool whizzbang tool. — Famous last words
  • 25. Lessons: • Copying files is a legitimate solution • In general, simple is good 25
  • 27. Endless Possibilities • Map Reduce • Crunch / Cascading • Spark • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 27
  • 29. Data Processing Mistake #1 29 This system must be ready in 12 month. We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it. Prototype? Who needs that? — Famous last words
  • 31. Lessons • Take learning curve into account • You don’t know what you don’t know • Hadoop will be difficult and frustrating for at least 3 month. 31
  • 32. Data Processing Mistake #2 32 Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs. — Famous last words
  • 34. Lessons: MapReduce is the assembly language of Hadoop: Simple things are hard. Hard things are possible. 34
  • 35. Data Processing Mistake #3 35 I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job. — Famous last words
  • 37. Lessons 1. Consolidate small files 2. Don’t argue about #1 3. Convert files to easy-to-query formats 4. De-normalize 37
  • 38. Data Processing Mistake #4 38 Partitions are for relational databases — Famous last words
  • 40. Lessons 1. Without partitions every query is a full table scan 2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform 4. Cheap storage allows you to store same dataset, partitioned multiple ways. 5. Use partitions for fast data loading 40
  • 42. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • Just copy files • Query Hadoop 42
  • 43. Data Loading Mistake #1 43 All of the data must end up in a relational DWH. — Famous last words
  • 45. Lessons: • Use Relational: • To maintain tool compatibility • DWH enrichment • Stay in Hadoop for: • Text search • Graph analysis • Reduce time in pipeline • Big data & small network • Congested database 45
  • 46. Data Loading Mistake #2 46 We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in. — Famous last words
  • 48. Lesson Use Oracle direct connectors if you can afford them. They are: 1. Faster than any alternative 2. Use Hadoop to make Oracle more efficient 3. Make *you* more efficient 48
  • 50. Tools • Oozie • Pentaho, Talend, ActiveBatch, AutoSys, Informatica, UC4, Cron 50
  • 51. Workflow Mistake #1 51 Workflow management is easy. I’ll just write few scripts. — Famous last words
  • 53. Lesson: Workflow management tool should enable: • Keeping track of metadata, components and integrations • Scheduling and Orchestration • Restarts and retries • Cohesive System View • Instrumentation, Measurement and Monitoring • Reporting 53
  • 54. Workflow Mistake #2 54 Schema? This is Hadoop. Why would we need a schema? — Famous last words
  • 57. Workflow Mistake #3 57 Oozie was written for Hadoop, so the right solution will always use Oozie — Famous last words
  • 59. Lessons: • Oozie has advantages • Use the tool that works for you 59
  • 63. 63
  • 64. Should DBAs learn Hadoop? • Hadoop projects are more visible • 48% of Hadoop clusters are owned by DWH team • Big Data == Business pays attention to data • New skills – from coding to cluster administration • Interesting projects • No, you don’t need to learn Java 64
  • 65. Beginner Projects • Take a class • Download a VM • Install 5 node Hadoop cluster in AWS • Load data: • Complete works of Shakespeare • Movielens database • Find the 10 most common words in Shakespeare • Find the 10 most recommended movies • Run TPC-H • Cloudera Data Science Challenge • Actual use-case: XML ingestion, ETL process, DWH history 65

Editor's Notes

  • #8: The data landscape looks totally different now than it did 30 years ago when the fundamental concepts and technologies of data management were developed. Since 1980, the birth of personal computing, the internet, mobile devices and sophisticated electronic machines have led to an explosion in data volume, variety and velocity. Simply said, the data you’re managing today looks nothing like the data you were managing in 1980. Actually, structured data only represents somewhere between 10-20% of the total data volume in any given enterprise.
  • #11: Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware. Two primary components, HDFS and MapReduce. Based on software originally developed at Google. An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed. Allows companies to begin storing data that was previously thrown away. Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • #12: ETL clusters are easy to use and manage, but are often inefficient. ELT is efficient, but spends expensive DWH cycles on low-value transformations, wastes storage on temp data and can cause missed SLAs. The next step is moving the ETL to Hadoop.
  • #15: There are many Hadoop technologies for ETL. Its easy to google and see what each one does and how to use it. The trick is to use the right technology at the right time, and avoid some common mistakes. So, I’m not going to waste time telling you how to use Hadoop. You’ll easily find out. I’ll tell you what to use it for, and how to avoid pitfalls.
  • #22: Database specific connectors make data ingest about 1000 times faster.
  • #23: Important note: Sqoop’s Connector to Oracle is NOT Oracle connector for Hadoop. It is called OraOop and is both free and open source
  • #26: If you get a bunch of files to a directory every day or hour, just copying them to Hadoop is fine. If you can do something in 1 line in shell, do it. Don’t overcomplicate. You are just starting to learn Hadoop, so make life easier on yourself.
  • #32: Give yourself a fighting chance
  • #35: These days there is hardly ever a reason to use plain map reduce. So many other good tools.
  • #37: Like walking Huanan trail: Not very stable and not very fast
  • #48: Mottainai – Japanese term for wasting resources or doing something in an inefficient manner
  • #49: Use Hadoop cores to pre-sort, pre-partition the data and turn it into Oracle data format. Then load it as “append” to Oracle. Uses no redo, very little CPU, lots of parallelism.