Hadoop Innovation Summit 2014

THE FUTURE OF
HADOOP: CHOOSING
THE RIGHT OPTIONS

Subash D’Souza
Hadoop Innovation Summit
2014

WHO AM I?
 Recognized as a Champion of Big Data by Cloudera
 Co-Organizer - Los Angeles Hadoop User Group
 Organizer - Los Angeles HBase User Group
 Organizer – Los Angeles Big Data Users Group
 Organizer - Big Data Camp LA
 Speaker – Big Data Camp LA 2013
 Leading a BOF Session at Hadoop Summit Europe 2014
 Author – HBase Developer’s Cookbook (Out Fall 2014)
 Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop

HADOOP: OLD & NEW
 Hadoop first released in 2006.
 Based on the GFS and MapReduce papers released by Google
 Ever since adoption has been massive and rapid
 Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the
Social Security Administration are adopting Hadoop
 Hadoop 2.0 AKA YARN went GA in September of 2013
 Is backwards compatible with Hadoop 1.0 API’s
 Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager
and Node Managers

A BRIEF HISTORY
Google
releases GFS
paper

2002

2003

Google
releases
MapReduce
paper

2004

Nutch adds
distributed
file system

Doug Cutting
launches
Nutch project

MapR
founded

2005

Hortonworks
founded

Cloudera
founded

2006

2007

Hadoop spun
out of Nutch
project at
Yahoo

MapReduce
implemented
in Nutch

Stinger/ Tez
to be
released

Hadoop 2.0
w/HA
available

2008

2009

2010

2011

Hadoop
breaks
Terasort
world record

2012

2013

2014

YARN goes
GA

HBase, Zookee
per, Flume and
more added to
CDH

Impala
(SQL on
Hadoop)
launched

PREVIOUSLY, THE STATE OF
DATA
As a data analyst, previously, you were not able to
ask questions you wanted to ask because you did
not have the data points available
Corollary, you couldn’t think of questions to ask of
your data because you didn’t know you had access
to those data points

FOCUS
 No standard way to get to the data
 This is a plus and minus, plus because there is variety to choose from, minus because the
no. of tools to pull the data is huge and evermore expanding

As a company what do you choose?
What do you focus on?
Question – Do you replace your current data
infrastructure or do you augment it?

DISTRIBUTIONS OF HADOOP
Apache
Hortonworks
Cloudera
MapR
Intel
IBM
Pivotal

HORTONWORKS HDP 2.0

Source: hortonworks.com

CLOUDERA ENTERPRISE
DATA HUB

Source: cloudera.com & techweekly.com

MAPR M7 ENTERPRISE

Source: business-software.com & wn.com

INTEL DISTRIBUTION FOR
APACHE HADOOP

Source: gigaom.com

IBM BIGINSIGHTS
ENTERPRISE EDITION

Source: ndm.net

CHOICES
 Hortonworks – Completely Open Source – Everything on their platform is available
from Apache Hadoop Distribution. Available as a free download or with paid
support.
 Cloudera – Offers the open source Apache Hadoop Distribution as well as
management tools built for the Cloudera Distribution. Available as a free download
or with paid support with the additional tools
 MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary
MFS(MapR File System). Everything else on their stack is based on the open
source Apache distribution. Offers a free M3 version along with paid M5 and M7
versions.

ADVANTAGES OF YARN
Ability to handle multi tenant clients, i.e. running
multiple
applications
atop
the
same
framework(multi-tenancy)
Splits the work of Job tracker into Resource
Manager and Application master so Job tracker
does not have to allocate resources as well as
manage the tasks
Ability to restart Jobs from the place where they
failed
Scales well beyond the limitations of MR1(4000

SQL-ON-HADOOP
The different
available
Hive
Impala
Drill
Stinger/Tez
HAWQ
Hadapt
Presto
Shark

SQL-On-Hadoop

tools

currently

SQL-ON-HADOOP
BENCHMARK - SCAN

Source:

SQL-ON-HADOOP
BENCHMARK - AGGREGATE

Source:

SQL-ON-HADOOP
BENCHMARK - JOIN

Source:

SQL ON HADOOP VS.
TRADITIONAL RDBMS
Data on Hadoop is not as responsive as a RDBMS
Data in Hadoop can scale much better than an
RDBMS
Data in Hadoop can be accessed using a variety of
mechanisms such as Hive, Imapala, Drill, etc. i.e.
the query engines are abstracted from the
Hadoop(HDFS) storage layer. The same cannot be
said of RDBMS where you would need between
one system to another example, Oracle cannot pull
from SQL Server and vice versa

QUESTION?
Do we augment or replace our current data
infrastructure?
Answer – Augment
Why? – combine the best of both worlds, use
aggregated data in your data stores and all the
detail data and lifetime in Hadoop
Of course, you will different SLA’s based on the
query you ask.

CHALLENGES
Data Protection
Security
SLA’s – Service Level Agreements
Integration w/ applications
Services and support
Training
Performance
Scaling and Administration

STARTUPS VS. MATURE
Startups that are in data should make the
consideration of going with YARN to gain the
advantages of YARN
Mature companies tend to be conservative and
hence will look to the more established use cases of
MR1
Startups and Mature companies should look at the
advantages of YARN as well as applying more near
real-time sql-on-hadoop

GETTING STARTED WITH
HADOOP VS. ESTABLISHED
HADOOP PRACTICES
Getting started with Hadoop – Opportunity to get off
the ground running YARN plus bleeding edge
technologies.
Established companies with a Hadoop practice tend
to be conservative but that shouldn’t prevent them
from coming with a migration plan to YARN

REAL TIME ANALYTICS
 Kiji
 HBase
 Storm
 Shark
 Redshift
 Impala
 Stinger
 Drill
 Accumolo
 Presto
 Hawq
 IBM BigSQL

REAL TIME STREAMING
Flume
Kafka
Scribe
HBase

SECURITY
Kerberos with ACL’s
Cloudera Sentry
Project Knox
Accumolo(BigTable clone)
HBase w/Cell Security

DEVELOPERS TOOLSET
Cloudera CDK renamed to Kite
Java M/R
Spring for Hadoop
Hive
Pig
Scalding
Impala
Others

MANAGEMENT, GUI, MACHIN
E
LEARNING, MONITORING, SC
HEDULING & GRAPH DB
Ambari
Cloudera Manager
HUE
Mahout
Giraph
Zookeeper
Oozie

FUTURE OF HADOOP: YARN &
NEAR REAL TIME SQL-ONHADOOP
Multi Tenancy
HA(High Availability)
Tools for SQL-On-Hadoop
Impala
Stinger/Tez
Drill
Shark

WHAT DO YOU CHOOSE?
The choices are huge
The toolsets are varied
First focus on the problems you are trying to solve. Don’t
choose Hadoop because it is the latest buzz word. Make
sure there is a real need to solve
Focus on developers and administrators and ensure that
whatever toolset you choose, they have the relevant
skillset or training will be provided or relevant resources will
be brought in from outside( whether through hiring or
consulting)
REMEMBER PROBLEMSET!!! i.e what you are trying to

CAVEATS
Work still being done on bringing real time sql-onhadoop to YARN.
Impala has Llama for this.
Stinger for Hive Preview is currently available
HBase on YARN(HOYA) is also actively being
worked on.
Since YARN is a low level API, some abstraction is
needed which is available with tools such as Samza
and Weave

BIG DATA = BIG IMPACT
Ken Rudin, Director of Analytics, Facebook
“You need to go the last mile and evangelize your
insights so that people actually act on them and
there is impact."
“It doesn’t matter how brilliant our analyses are. If
nothing changes we have made no impact”

GIVING BACK
Hadoop is an open source project
Work done on this and the ecosystem tools are by
committers and contributors, some of whom do this in
their own personal time, in reporting and fixing bugs as
well as new functionality.
Please
give
back
either
by
becoming
a
contributor(Testing, filing bugs) or getting out your use
case for Hadoop(at meetups and/or conferences such
as this one) so others can make use of the issues you
have faced as well see the rapid adoption of the

THANKS
Subash D’Souza
Twitter: @sawjd22
Linkedin: www.linkedin.com/in/sawjd/
Email: subashdsouza@gmail.com

Hadoop Innovation Summit 2014

Recommended

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Hadoop Innovation Summit 2014 (20)

More from Data Con LA (20)

Recently uploaded (20)

Hadoop Innovation Summit 2014