SlideShare a Scribd company logo
Sankt Augustin
24-25.08.2013
Introduction to the
Hadoop Ecosystem
uweseiler
Sankt Augustin
24-25.08.2013 About me
Big Data Nerd
TravelpiratePhotography Enthusiast
Hadoop Trainer MongoDB Author
Sankt Augustin
24-25.08.2013 About us
is a bunch of…
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
Sankt Augustin
24-25.08.2013 Agenda
• What is Big Data & Hadoop?
• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Sankt Augustin
24-25.08.2013 Why Big Data?
The volume of datasets is
constantly growing…
Sankt Augustin
24-25.08.2013 Volume
2008
200 PB a
day
2009
2,5 PB
user data
15 TB a
day
2009
6,5 PB
User Data
50 TB
a day
2011
~200 PB
Data
Sankt Augustin
24-25.08.2013 Why Big Data?
The velocity of data generation is
getting faster and faster…
Sankt Augustin
24-25.08.2013 Velocity
Sankt Augustin
24-25.08.2013 Why Big Data?
The variety of data is increasing…
Sankt Augustin
24-25.08.2013 Variety
Structur
ed data
Semi-
structur
ed data Unstruct
ured data
Sankt Augustin
24-25.08.2013 The 3 V’s of Big Data
VarietyVolume Velocity
Sankt Augustin
24-25.08.2013 My favorite definition
Sankt Augustin
24-25.08.2013 Why Hadoop?
TraditionaldataStoresare expensive to scale
and by Design difficult to Distribute
Scale out is the way to go!
Sankt Augustin
24-25.08.2013 How to scale data?
“Data“
r r
“Result“
w w
worker workerworker
w
r
Sankt Augustin
24-25.08.2013 But…
Parallel processing is
complicated!
Sankt Augustin
24-25.08.2013 But…
Data storage is not
trivial!
Sankt Augustin
24-25.08.2013 What is Hadoop?
Distributed Storage and
Computation Framework
Sankt Augustin
24-25.08.2013 What is Hadoop?
«Big Data» != Hadoop
Sankt Augustin
24-25.08.2013 What is Hadoop?
Hadoop != Database
Sankt Augustin
24-25.08.2013 What is Hadoop?
Sankt Augustin
24-25.08.2013 What is Hadoop?
“Swiss army knife
of the 21st century”
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
Sankt Augustin
24-25.08.2013 The Hadoop App Store
HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra
Chukwa
Intel
Sync
Flume Hana HyperT Impala Mahout Nutch Oozie Scoop
Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC
IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper
Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
Sankt Augustin
24-25.08.2013
Functionalityless more
Apache
Hadoop
Hadoop
Distributions
Big Data
Suites
• HDFS
• MapReduce
• Hadoop Ecosystem
• Hadoop YARN
• Test & Packaging
• Installation
• Monitoring
• Business Support
+
• Integrated Environment
• Visualization
• (Near-)Realtime analysis
• Modeling
• ETL & Connectors
+
The Hadoop App Store
Sankt Augustin
24-25.08.2013 The essentials …
Core Hadoop
Sankt Augustin
24-25.08.2013 Data Storage
OK, first things
first!
I want to store all of
my <<Big Data>>
Sankt Augustin
24-25.08.2013 Data Storage
Sankt Augustin
24-25.08.2013 Hadoop Distributed File System
• Distributed file system for
redundant storage
• Designed to reliably store data
on commodity hardware
• Built to expect hardware
failures
Sankt Augustin
24-25.08.2013 Hadoop Distributed File System
Intended for
• large files
• batch inserts
Sankt Augustin
24-25.08.2013 HDFS Architecture
NameNode
Master
Block Map
Slave Slave Slave
Rack 1 Rack 2
Journal Log
DataNode DataNode DataNode
File
Client
Secondary
NameNode
Helper
periodical merges
#1 #2
#1 #1 #1
Sankt Augustin
24-25.08.2013 Data Processing
Data stored, check!
Now I want to
create insights
from my data!
Sankt Augustin
24-25.08.2013 Data Processing
Sankt Augustin
24-25.08.2013 MapReduce
• Programming model for
distributed computations at a
massive scale
• Execution framework for
organizing and performing such
computations
• Data locality is king
Sankt Augustin
24-25.08.2013 Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
MapReduce
Sankt Augustin
24-25.08.2013 MapReduce Flow
Combine Combine Combine Combine
a b 2 c 9 a 3 c 2 b 7 c 8
Partition Partition Partition Partition
Shuffle and Sort
Map Map Map Map
a b 2 c 3 c 6 a 3 c 2 b 7 c 8
a 1 3 b 7 c 2 8 9
Reduce Reduce Reduce
a 4 b 9 c 19
Sankt Augustin
24-25.08.2013 Combined Hadoop Architecture
Client
NameNode
Master
Slave
TaskTracker
Secondary
NameNode
Helper
JobTracker
DataNode
File
Job
Block
Task
Slave
TaskTracker
DataNode
Block
Task
Slave
TaskTracker
DataNode
Block
Task
Sankt Augustin
24-25.08.2013 Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Sankt Augustin
24-25.08.2013 Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
Sankt Augustin
24-25.08.2013 Scripting for Hadoop
Java for MapReduce?
I dunno, dude…
I’m more of a
scripting guy…
Sankt Augustin
24-25.08.2013 Scripting for Hadoop
Sankt Augustin
24-25.08.2013 Apache Pig
• High-level data flow language
• Made of two components:
• Data processing language Pig Latin
• Compiler to translate Pig Latin to
MapReduce
Sankt Augustin
24-25.08.2013 Pig in the Hadoop ecosystem
HDFS
Hadoop Distributed File System
MapReduce
Distributed Programming Framework
HCatalog
Metadata Management
Pig
Scripting
Sankt Augustin
24-25.08.2013 Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
filteredUsers = FILTER users BY age >= 18 and age <=50;
joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Sankt Augustin
24-25.08.2013 Pig Execution Plan
Sankt Augustin
24-25.08.2013 Try that with Java…
Sankt Augustin
24-25.08.2013 SQL for Hadoop
OK, Pig seems quite
useful…
But I’m more of a
SQL person…
Sankt Augustin
24-25.08.2013 SQL for Hadoop
Sankt Augustin
24-25.08.2013 Apache Hive
• Data Warehousing Layer on top of
Hadoop
• Allows analysis and queries
using a SQL-like language
Sankt Augustin
24-25.08.2013 Hive in the Hadoop ecosystem
HDFS
Hadoop Distributed File System
MapReduce
Distributed Programming Framework
HCatalog
Metadata Management
Pig
Scripting
Hive
Query
Sankt Augustin
24-25.08.2013 Hive Architecture
Hive
Hive
Engine
HDFS
MapReduce
Meta-
store
Thrift
Applications
JDBC
Applications
ODBC
Applications
Hive Thrift
Driver
Hive JDBC
Driver
Hive ODBC
Driver
Hive
Server
Hive
Shell
Sankt Augustin
24-25.08.2013 Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO
TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN
pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
Sankt Augustin
24-25.08.2013 But there’s still more…
More components of the
Hadoop Ecosystem
Sankt Augustin
24-25.08.2013
HDFS
Data storage
MapReduce
Data processing
HCatalog
Metadata Management
Pig
Scripting
Hive
SQL-like queries
HBase
NoSQLDatabase
Mahout
Machine Learning
ZooKeeper
ClusterCoordination
Scoop
Import & Export of
relational data
Ambari
Clusterinstallation&management
Oozie
Workflowautomatization
Flume
Import & Export of
data flows
Sankt Augustin
24-25.08.2013 Bringing it all together…
Use Cases
Sankt Augustin
24-25.08.2013DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
Classical enterprise platform
Sankt Augustin
24-25.08.2013DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
Big Data Platform
Sankt Augustin
24-25.08.2013
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
2
3
4
1
2
3
4
Capture
all data
Process
the data
Exchange
using
traditional
systems
Process &
Visualize
with
traditional
applications
Pattern #1: Refine data
Sankt Augustin
24-25.08.2013
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
2
3
1
2
3
Capture
all data
Process
the data
Explore the
data using
applications
with support
for Hadoop
Pattern #2: Explore data
Sankt Augustin
24-25.08.2013
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Applications
Custom
Applications
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
1
3 1
2
3
Capture
all data
Process
the data
Directly
ingest the
data
Pattern #3: Enrich data
2
Sankt Augustin
24-25.08.2013 Bringing it all together…
One example…
Sankt Augustin
24-25.08.2013 Digital Advertising
• 6 billion ad deliveries per day
• Reports (and bills) for the
advertising companies needed
• Own C++ solution did not scale
• Adding functions was a nightmare
Sankt Augustin
24-25.08.2013
Campaign
Database
FFM AMS
TCP
Interface
TCP
Interface
Custom
Flume
Source
Custom
Flume
Source
Flume HDFS Sink
Local files
Campaign
Data
Hadoop Cluster
Binary
Log Format
Synchronisation
Pig Hive
Temporäre
Daten
NAS
Aggregated
data
Report
Engine
Direct
Download
Job
Scheduler
Config UI Job Config
XML
Start
AdServer
AdServer
AdServing Architecture
Sankt Augustin
24-25.08.2013 What’s next?
Hadoop 2.0
aka YARN
Sankt Augustin
24-25.08.2013
HDFS
Hadoop 1.0
Built for web-scale batch apps
HDFS HDFS
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Sankt Augustin
24-25.08.2013 MapReduce is good for…
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
Sankt Augustin
24-25.08.2013 MapReduce is OK for…
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to
disk
• I/O and latency cost of an iteration is
high
Sankt Augustin
24-25.08.2013 MapReduce is not good for…
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
Sankt Augustin
24-25.08.2013 MapReduce limitations
• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
• Hard partition of resources into map & reduce slots
– Low resource utilization
• Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are 10x
slower
Sankt Augustin
24-25.08.2013
Hadoop 1.0
HDFS
Redundant, reliable
storage
Hadoop 2.0: Next-gen platform
MapReduce
Cluster resource mgmt
+ data processing
Hadoop 2.0
HDFS 2.0
Redundant, reliable storage
MapReduce
Data processing
Single use system
Batch Apps
Multi-purpose platform
Batch, Interactive, Streaming, …
YARN
Cluster resource management
Others
Data processing
Sankt Augustin
24-25.08.2013 YARN: Taking Hadoop beyond batch
Applications run natively in Hadoop
HDFS 2.0
Redundant, reliable storage
Batch
MapReduce
Store all data in one place
Interact with data in multiple ways
YARN
Cluster resource management
Interactive
Tez
Online
HOYA
Streaming
Storm, …
Graph
Giraph
In-Memory
Spark
Other
Search,
…
Sankt Augustin
24-25.08.2013 A brief history of YARN
• Originally conceived & architected by the
team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the YARN release manager
• The team at Hortonworks has been working
on YARN for 4 years:
– 90% of code from Hortonworks & Yahoo!
• YARN based architecture running at scale at Yahoo!
– Deployed on 35,000 nodes for 6+ months
• Going GA at the end of 2013?
Sankt Augustin
24-25.08.2013 YARN concepts
• Application
– Application is a job submitted to the framework
– Example: Map Reduce job
• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple
resources (memory, CPU, disk, network, GPU, …)
• container_0 = 2GB, 1CPU
• container_1 = 1GB, 6 CPU
– Replaces the fixed map/reduce slots
Sankt Augustin
24-25.08.2013 YARN architecture
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
AM 1
Container 1.2
Container 1.1
AM 2
Container 2.1
Container 2.2
Container 2.3
Sankt Augustin
24-25.08.2013 YARN architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
Sankt Augustin
24-25.08.2013 YARN architecture
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce 2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
Sankt Augustin
24-25.08.2013 YARN summary
1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
Sankt Augustin
24-25.08.2013 Getting started…
One more thing…
Sankt Augustin
24-25.08.2013 User Groups
HUG Rhein-Ruhr (Düsseldorf)
– https://www.xing.com/net/hugrheinruhr/
HUG Rhein-Main (Frankfurt)
– https://www.xing.com/net/hugrheinmain/
– http://www.meetup.com/HUG-Rhein-Main/
Big Data Beers (Berlin)
– http://www.meetup.com/Big-Data-Beers/
HUG München
– http://www.meetup.com/Hadoop-User-Group-Munich/
HUG Karlsruhe/Stuttgart
– http://www.meetup.com/Hadoop-and-Big-Data-User-Group-in-Karlsruhe-Stuttgart/
Sankt Augustin
24-25.08.2013 Books about Hadoop
Hadoop, The Definite Guide; Tom White;
3rd ed.; O’Reilly; 2012.
Hadoop in Action; Chuck Lam;
Manning; 2011.
Programming Pig; Alan Gates;
O’Reilly; 2011.
Hadoop Operations; Eric Sammer;
O’Reilly; 2012.
Sankt Augustin
24-25.08.2013 Hortonworks Sandbox
http://hortonworks.com/products/hortonworsk-sandbox
Sankt Augustin
24-25.08.2013 Hadoop Training
• Programming with Hadoop
• 4-day class
• 16.09. – 19.09.2013, München
• 28.10. – 31.10.2013, Frankfurt
• 02.12. – 05.12.2013, Düsseldorf
• Administration of Hadoop
• 3-day class
• 23. – 25.09.2013, München
• 04. – 06.11.2013, Frankfurt
• 09. – 11.12.2013, Düsseldorf
http://www.codecentric.de/portfolio/schulungen-und-workshops

More Related Content

What's hot (20)

PPTX
Hadoop overview
Siva Pandeti
 
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
PPTX
The hadoop 2.0 ecosystem and yarn
Michael Joseph
 
PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Apache Hadoop at 10
Cloudera, Inc.
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PDF
Hadoop Ecosystem
Sandip Darwade
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PDF
Hadoop ecosystem
Stanley Wang
 
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPTX
Hadoop project design and a usecase
sudhakara st
 
PDF
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PDF
Hadoop ecosystem
Ran Silberman
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop overview
Siva Pandeti
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
The hadoop 2.0 ecosystem and yarn
Michael Joseph
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Big data and Hadoop
Rahul Agarwal
 
Apache Hadoop at 10
Cloudera, Inc.
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Hadoop Ecosystem
Sandip Darwade
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Hadoop ecosystem
Stanley Wang
 
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop project design and a usecase
sudhakara st
 
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop ecosystem
Ran Silberman
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Introduction to Apache Hadoop
Christopher Pezza
 

Viewers also liked (17)

PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
Linux fundamental - Chap 11 boot
Kenny (netman)
 
PDF
Automotive Grade Linux and systemd
Alison Chaiken
 
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
PPTX
First steps on CentOs7
Marc Cortinas Val
 
PDF
Systemd for developers
Alison Chaiken
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
PPTX
Hadoop And Their Ecosystem ppt
sunera pathan
 
PPTX
Hadoop Ecosystem at a Glance
Neev Technologies
 
PPT
Big data introduction, Hadoop in details
Mahmoud Yassin
 
PDF
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 
PPT
Hadoop ecosystem
tfmailru
 
PDF
Introduction to Hadoop Ecosystem
GetInData
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Seminar Presentation Hadoop
Varun Narang
 
Big Data Analytics with Hadoop
Philippe Julio
 
Linux fundamental - Chap 11 boot
Kenny (netman)
 
Automotive Grade Linux and systemd
Alison Chaiken
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
First steps on CentOs7
Marc Cortinas Val
 
Systemd for developers
Alison Chaiken
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop Ecosystem at a Glance
Neev Technologies
 
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 
Hadoop ecosystem
tfmailru
 
Introduction to Hadoop Ecosystem
GetInData
 
Ad

Similar to Introduction to the Hadoop Ecosystem (FrOSCon Edition) (20)

PPT
Hadoop in action
Mahmoud Yassin
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PPTX
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
PPTX
Hadoop workshop
Purna Chander
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPTX
Big data ppt
Shweta Sahu
 
PPTX
Hadoop, Infrastructure and Stack
John Dougherty
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PPT
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
ODP
Hadoop demo ppt
Phil Young
 
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
PPTX
Hadoop for the Absolute Beginner
Ike Ellis
 
PDF
Asd 2015
Rim Moussa
 
PDF
Big data, Hadoop, NoSQL DB - introduction
kvaderlipa
 
PDF
Big data for the rest of us with hadoop
Dhaval Anjaria
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPT
Hadoop HDFS.ppt
6535ANURAGANURAG
 
PDF
Big data and hadoop
Kishor Parkhe
 
Hadoop in action
Mahmoud Yassin
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
Hadoop workshop
Purna Chander
 
Hands on Hadoop and pig
Sudar Muthu
 
Big data ppt
Shweta Sahu
 
Hadoop, Infrastructure and Stack
John Dougherty
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Hadoop demo ppt
Phil Young
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Hadoop for the Absolute Beginner
Ike Ellis
 
Asd 2015
Rim Moussa
 
Big data, Hadoop, NoSQL DB - introduction
kvaderlipa
 
Big data for the rest of us with hadoop
Dhaval Anjaria
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Big data and hadoop
Kishor Parkhe
 
Ad

More from Uwe Printz (18)

PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
PDF
Hadoop & Security - Past, Present, Future
Uwe Printz
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PDF
Apache Spark
Uwe Printz
 
PDF
Lightning Talk: Agility & Databases
Uwe Printz
 
PDF
Hadoop 2 - More than MapReduce
Uwe Printz
 
PDF
Welcome to Hadoop2Land!
Uwe Printz
 
PDF
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
PDF
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Uwe Printz
 
PDF
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
PDF
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
PDF
MongoDB für Java-Programmierer
Uwe Printz
 
PDF
Introduction to Twitter Storm
Uwe Printz
 
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
PDF
Map/Confused? A practical approach to Map/Reduce with MongoDB
Uwe Printz
 
PDF
First meetup of the MongoDB User Group Frankfurt
Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Hadoop & Security - Past, Present, Future
Uwe Printz
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
Apache Spark
Uwe Printz
 
Lightning Talk: Agility & Databases
Uwe Printz
 
Hadoop 2 - More than MapReduce
Uwe Printz
 
Welcome to Hadoop2Land!
Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
MongoDB für Java-Programmierer
Uwe Printz
 
Introduction to Twitter Storm
Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Uwe Printz
 
First meetup of the MongoDB User Group Frankfurt
Uwe Printz
 

Recently uploaded (20)

PPTX
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PPTX
UI5Con 2025 - Get to Know Your UI5 Tooling
Wouter Lemaire
 
PDF
Trading Volume Explained by CIFDAQ- Secret Of Market Trends
CIFDAQ
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
Julia Furst Morgado The Lazy Guide to Kubernetes with EKS Auto Mode + Karpenter
AWS Chicago
 
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
Top Managed Service Providers in Los Angeles
Captain IT
 
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
UI5Con 2025 - Get to Know Your UI5 Tooling
Wouter Lemaire
 
Trading Volume Explained by CIFDAQ- Secret Of Market Trends
CIFDAQ
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Julia Furst Morgado The Lazy Guide to Kubernetes with EKS Auto Mode + Karpenter
AWS Chicago
 

Introduction to the Hadoop Ecosystem (FrOSCon Edition)

  • 1. Sankt Augustin 24-25.08.2013 Introduction to the Hadoop Ecosystem uweseiler
  • 2. Sankt Augustin 24-25.08.2013 About me Big Data Nerd TravelpiratePhotography Enthusiast Hadoop Trainer MongoDB Author
  • 3. Sankt Augustin 24-25.08.2013 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. Sankt Augustin 24-25.08.2013 Agenda • What is Big Data & Hadoop? • Core Hadoop • The Hadoop Ecosystem • Use Cases • What‘s next? Hadoop 2.0!
  • 5. Sankt Augustin 24-25.08.2013 Why Big Data? The volume of datasets is constantly growing…
  • 6. Sankt Augustin 24-25.08.2013 Volume 2008 200 PB a day 2009 2,5 PB user data 15 TB a day 2009 6,5 PB User Data 50 TB a day 2011 ~200 PB Data
  • 7. Sankt Augustin 24-25.08.2013 Why Big Data? The velocity of data generation is getting faster and faster…
  • 9. Sankt Augustin 24-25.08.2013 Why Big Data? The variety of data is increasing…
  • 10. Sankt Augustin 24-25.08.2013 Variety Structur ed data Semi- structur ed data Unstruct ured data
  • 11. Sankt Augustin 24-25.08.2013 The 3 V’s of Big Data VarietyVolume Velocity
  • 12. Sankt Augustin 24-25.08.2013 My favorite definition
  • 13. Sankt Augustin 24-25.08.2013 Why Hadoop? TraditionaldataStoresare expensive to scale and by Design difficult to Distribute Scale out is the way to go!
  • 14. Sankt Augustin 24-25.08.2013 How to scale data? “Data“ r r “Result“ w w worker workerworker w r
  • 15. Sankt Augustin 24-25.08.2013 But… Parallel processing is complicated!
  • 16. Sankt Augustin 24-25.08.2013 But… Data storage is not trivial!
  • 17. Sankt Augustin 24-25.08.2013 What is Hadoop? Distributed Storage and Computation Framework
  • 18. Sankt Augustin 24-25.08.2013 What is Hadoop? «Big Data» != Hadoop
  • 19. Sankt Augustin 24-25.08.2013 What is Hadoop? Hadoop != Database
  • 21. Sankt Augustin 24-25.08.2013 What is Hadoop? “Swiss army knife of the 21st century” http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  • 22. Sankt Augustin 24-25.08.2013 The Hadoop App Store HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra Chukwa Intel Sync Flume Hana HyperT Impala Mahout Nutch Oozie Scoop Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
  • 23. Sankt Augustin 24-25.08.2013 Functionalityless more Apache Hadoop Hadoop Distributions Big Data Suites • HDFS • MapReduce • Hadoop Ecosystem • Hadoop YARN • Test & Packaging • Installation • Monitoring • Business Support + • Integrated Environment • Visualization • (Near-)Realtime analysis • Modeling • ETL & Connectors + The Hadoop App Store
  • 24. Sankt Augustin 24-25.08.2013 The essentials … Core Hadoop
  • 25. Sankt Augustin 24-25.08.2013 Data Storage OK, first things first! I want to store all of my <<Big Data>>
  • 27. Sankt Augustin 24-25.08.2013 Hadoop Distributed File System • Distributed file system for redundant storage • Designed to reliably store data on commodity hardware • Built to expect hardware failures
  • 28. Sankt Augustin 24-25.08.2013 Hadoop Distributed File System Intended for • large files • batch inserts
  • 29. Sankt Augustin 24-25.08.2013 HDFS Architecture NameNode Master Block Map Slave Slave Slave Rack 1 Rack 2 Journal Log DataNode DataNode DataNode File Client Secondary NameNode Helper periodical merges #1 #2 #1 #1 #1
  • 30. Sankt Augustin 24-25.08.2013 Data Processing Data stored, check! Now I want to create insights from my data!
  • 32. Sankt Augustin 24-25.08.2013 MapReduce • Programming model for distributed computations at a massive scale • Execution framework for organizing and performing such computations • Data locality is king
  • 33. Sankt Augustin 24-25.08.2013 Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output MapReduce
  • 34. Sankt Augustin 24-25.08.2013 MapReduce Flow Combine Combine Combine Combine a b 2 c 9 a 3 c 2 b 7 c 8 Partition Partition Partition Partition Shuffle and Sort Map Map Map Map a b 2 c 3 c 6 a 3 c 2 b 7 c 8 a 1 3 b 7 c 2 8 9 Reduce Reduce Reduce a 4 b 9 c 19
  • 35. Sankt Augustin 24-25.08.2013 Combined Hadoop Architecture Client NameNode Master Slave TaskTracker Secondary NameNode Helper JobTracker DataNode File Job Block Task Slave TaskTracker DataNode Block Task Slave TaskTracker DataNode Block Task
  • 36. Sankt Augustin 24-25.08.2013 Word Count Mapper in Java public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 37. Sankt Augustin 24-25.08.2013 Word Count Reducer in Java public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); } output.collect(key, new IntWritable(sum)); } }
  • 38. Sankt Augustin 24-25.08.2013 Scripting for Hadoop Java for MapReduce? I dunno, dude… I’m more of a scripting guy…
  • 40. Sankt Augustin 24-25.08.2013 Apache Pig • High-level data flow language • Made of two components: • Data processing language Pig Latin • Compiler to translate Pig Latin to MapReduce
  • 41. Sankt Augustin 24-25.08.2013 Pig in the Hadoop ecosystem HDFS Hadoop Distributed File System MapReduce Distributed Programming Framework HCatalog Metadata Management Pig Scripting
  • 42. Sankt Augustin 24-25.08.2013 Pig Latin users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO 'top10sites';
  • 44. Sankt Augustin 24-25.08.2013 Try that with Java…
  • 45. Sankt Augustin 24-25.08.2013 SQL for Hadoop OK, Pig seems quite useful… But I’m more of a SQL person…
  • 47. Sankt Augustin 24-25.08.2013 Apache Hive • Data Warehousing Layer on top of Hadoop • Allows analysis and queries using a SQL-like language
  • 48. Sankt Augustin 24-25.08.2013 Hive in the Hadoop ecosystem HDFS Hadoop Distributed File System MapReduce Distributed Programming Framework HCatalog Metadata Management Pig Scripting Hive Query
  • 49. Sankt Augustin 24-25.08.2013 Hive Architecture Hive Hive Engine HDFS MapReduce Meta- store Thrift Applications JDBC Applications ODBC Applications Hive Thrift Driver Hive JDBC Driver Hive ODBC Driver Hive Server Hive Shell
  • 50. Sankt Augustin 24-25.08.2013 Hive Example CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING); LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;
  • 51. Sankt Augustin 24-25.08.2013 But there’s still more… More components of the Hadoop Ecosystem
  • 52. Sankt Augustin 24-25.08.2013 HDFS Data storage MapReduce Data processing HCatalog Metadata Management Pig Scripting Hive SQL-like queries HBase NoSQLDatabase Mahout Machine Learning ZooKeeper ClusterCoordination Scoop Import & Export of relational data Ambari Clusterinstallation&management Oozie Workflowautomatization Flume Import & Export of data flows
  • 53. Sankt Augustin 24-25.08.2013 Bringing it all together… Use Cases
  • 54. Sankt Augustin 24-25.08.2013DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications Operation Manage & Monitor Dev Tools Build & Test Classical enterprise platform
  • 55. Sankt Augustin 24-25.08.2013DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications Operation Manage & Monitor Dev Tools Build & Test New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform Big Data Platform
  • 56. Sankt Augustin 24-25.08.2013 DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 2 3 4 1 2 3 4 Capture all data Process the data Exchange using traditional systems Process & Visualize with traditional applications Pattern #1: Refine data
  • 57. Sankt Augustin 24-25.08.2013 DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Intelligence Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 2 3 1 2 3 Capture all data Process the data Explore the data using applications with support for Hadoop Pattern #2: Explore data
  • 58. Sankt Augustin 24-25.08.2013 DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Traditional Systems RDBMS EDW MPP … Business Applications Custom Applications New Sources Logs Mails Sensor …Social Media Enterprise Hadoop Plattform 1 3 1 2 3 Capture all data Process the data Directly ingest the data Pattern #3: Enrich data 2
  • 59. Sankt Augustin 24-25.08.2013 Bringing it all together… One example…
  • 60. Sankt Augustin 24-25.08.2013 Digital Advertising • 6 billion ad deliveries per day • Reports (and bills) for the advertising companies needed • Own C++ solution did not scale • Adding functions was a nightmare
  • 61. Sankt Augustin 24-25.08.2013 Campaign Database FFM AMS TCP Interface TCP Interface Custom Flume Source Custom Flume Source Flume HDFS Sink Local files Campaign Data Hadoop Cluster Binary Log Format Synchronisation Pig Hive Temporäre Daten NAS Aggregated data Report Engine Direct Download Job Scheduler Config UI Job Config XML Start AdServer AdServer AdServing Architecture
  • 62. Sankt Augustin 24-25.08.2013 What’s next? Hadoop 2.0 aka YARN
  • 63. Sankt Augustin 24-25.08.2013 HDFS Hadoop 1.0 Built for web-scale batch apps HDFS HDFS Single App Batch Single App Batch Single App Batch Single App Batch Single App Batch
  • 64. Sankt Augustin 24-25.08.2013 MapReduce is good for… • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset
  • 65. Sankt Augustin 24-25.08.2013 MapReduce is OK for… • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • I/O and latency cost of an iteration is high
  • 66. Sankt Augustin 24-25.08.2013 MapReduce is not good for… • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records
  • 67. Sankt Augustin 24-25.08.2013 MapReduce limitations • Scalability – Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map & reduce slots – Low resource utilization • Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x slower
  • 68. Sankt Augustin 24-25.08.2013 Hadoop 1.0 HDFS Redundant, reliable storage Hadoop 2.0: Next-gen platform MapReduce Cluster resource mgmt + data processing Hadoop 2.0 HDFS 2.0 Redundant, reliable storage MapReduce Data processing Single use system Batch Apps Multi-purpose platform Batch, Interactive, Streaming, … YARN Cluster resource management Others Data processing
  • 69. Sankt Augustin 24-25.08.2013 YARN: Taking Hadoop beyond batch Applications run natively in Hadoop HDFS 2.0 Redundant, reliable storage Batch MapReduce Store all data in one place Interact with data in multiple ways YARN Cluster resource management Interactive Tez Online HOYA Streaming Storm, … Graph Giraph In-Memory Spark Other Search, …
  • 70. Sankt Augustin 24-25.08.2013 A brief history of YARN • Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and now is the YARN release manager • The team at Hortonworks has been working on YARN for 4 years: – 90% of code from Hortonworks & Yahoo! • YARN based architecture running at scale at Yahoo! – Deployed on 35,000 nodes for 6+ months • Going GA at the end of 2013?
  • 71. Sankt Augustin 24-25.08.2013 YARN concepts • Application – Application is a job submitted to the framework – Example: Map Reduce job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resources (memory, CPU, disk, network, GPU, …) • container_0 = 2GB, 1CPU • container_1 = 1GB, 6 CPU – Replaces the fixed map/reduce slots
  • 72. Sankt Augustin 24-25.08.2013 YARN architecture Split up the two major functions of the JobTracker Cluster resource management & Application life-cycle management ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler AM 1 Container 1.2 Container 1.1 AM 2 Container 2.1 Container 2.2 Container 2.3
  • 73. Sankt Augustin 24-25.08.2013 YARN architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – e.g. MapReduce Application Master
  • 74. Sankt Augustin 24-25.08.2013 YARN architecture ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2
  • 75. Sankt Augustin 24-25.08.2013 YARN summary 1. Scale 2. New programming models & Services 3. Improved cluster utilization 4. Agility 5. Beyond Java
  • 76. Sankt Augustin 24-25.08.2013 Getting started… One more thing…
  • 77. Sankt Augustin 24-25.08.2013 User Groups HUG Rhein-Ruhr (Düsseldorf) – https://www.xing.com/net/hugrheinruhr/ HUG Rhein-Main (Frankfurt) – https://www.xing.com/net/hugrheinmain/ – http://www.meetup.com/HUG-Rhein-Main/ Big Data Beers (Berlin) – http://www.meetup.com/Big-Data-Beers/ HUG München – http://www.meetup.com/Hadoop-User-Group-Munich/ HUG Karlsruhe/Stuttgart – http://www.meetup.com/Hadoop-and-Big-Data-User-Group-in-Karlsruhe-Stuttgart/
  • 78. Sankt Augustin 24-25.08.2013 Books about Hadoop Hadoop, The Definite Guide; Tom White; 3rd ed.; O’Reilly; 2012. Hadoop in Action; Chuck Lam; Manning; 2011. Programming Pig; Alan Gates; O’Reilly; 2011. Hadoop Operations; Eric Sammer; O’Reilly; 2012.
  • 79. Sankt Augustin 24-25.08.2013 Hortonworks Sandbox http://hortonworks.com/products/hortonworsk-sandbox
  • 80. Sankt Augustin 24-25.08.2013 Hadoop Training • Programming with Hadoop • 4-day class • 16.09. – 19.09.2013, München • 28.10. – 31.10.2013, Frankfurt • 02.12. – 05.12.2013, Düsseldorf • Administration of Hadoop • 3-day class • 23. – 25.09.2013, München • 04. – 06.11.2013, Frankfurt • 09. – 11.12.2013, Düsseldorf http://www.codecentric.de/portfolio/schulungen-und-workshops