Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What
is
Hadoop,
and
When
Should
I

Consider
Using
It?

Houston
HUG

June
6th,
2011

Vikram
Oberoi,
Cloudera

Copyright
2011
Cloudera
Inc.
All
rights
reserved

About
me

•  Data
engineer
at
Cloudera,
present

• 
Using
data
and
Hadoop
to
enable
more
responsive
support

•  Data
engineer
at
Meebo,
Aug
’09
–
Nov’10

•  Data
infrastructure,
analyLcs

•  CS
at
Stanford,
’09

•  Senior
project:
ext3
and
XFS
under
Hadoop
MapReduce

workloads

•  Data
engineer
at
Meebo,
’08

•  Built
an
A/B
tesLng
system

•  SDE
Intern
at
Amazon,
’07

•  R&D
on
item-‐to-‐item
similariLes

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
will
I
talk
about?

•  What
is
Hadoop?

•  Typical
Hadoop-‐able
problems
and
use
cases

•  Cloudera
overview

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
Hadoop?

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Big
Data
Problem:
Exploding
Data
Volumes

•  Online

•  Web-‐ready
devices

•  Social
media

Complex, Unstructured
•  Digital
content

•  Enterprise

•  TransacLons

Relational
•  R&D
data

•  OperaLonal
(control)
data

•  Open
data
iniLaLves

•  2,500 exabytes of new information in 2012 with Internet as primary driver
•  Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the
Digital Universe Expands. May 2009.
Copyright
2011
Cloudera
Inc.
All
rights
reserved
.

Big
Data
Problem:
Data
Economics

• 
Return
on
Byte
=
value
to
be
extracted
from
that
byte
/
cost
of
storing
that

byte

• 
If
ROB
is
<
1
then
it
will
be
buried
into
tape
wasteland,
thus
we
need

cheaper
ac#ve
storage.

High
ROB

Low
ROB

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Hadoop:
A
Data
PlaEorm
with
Unique
Beneﬁts

• 
Consolidates
Everything

• 
Move
complex
and
relaLonal

data
into
a
single
repository

• 
Stores
Inexpensively

MapReduce
• 
Keep
raw
data
always
available

• 
Use
commodity
hardware

• 
Processes
at
the
Source

Hadoop
Distributed
• 
Eliminate
ETL
boglenecks

File
System
(HDFS)
• 
Mine
data
ﬁrst,
govern
later

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Hadoop
Distributed
File
System
(HDFS)

“How
is
data
stored?”

•  Based
on
design
of
Google’s
GFS

•  Data
stored
in
large
ﬁles

•  Files
can
contain
any
data

•  Files
separated
into
blocks

•  64MB
up
to
256MB
per
block
(tunable)

•  Each
block
replicated
across
a
cluster
(tunable,
usually
3

replicas
across
the
cluster)

•  This
buys
you:
fault
tolerance,
parallelizable
disk
reads

•  Store
whatever
you
want
in
it

•  This
buys
you:
ﬂexibility

Copyright
2011
Cloudera
Inc.
All
rights
reserved

MapReduce

“How
is
data
processed?”

•  Framework
designed
for
parallel
processing
of
large
disk

bound
batch
jobs

•  Data
processed
at
the
source

•  File
‘foo’
has
5
blocks,
processing
happens
on
5
nodes

•  Parallelized
disk
reads
à
remove
disk
bogleneck

•  Way
to
express
algorithms
such
that
they
are

parallelizable

•  Two
funcLons
at
the
core
of
every
job:

•  Map
funcLon
(group
by)

•  Reduce
funcLon
(perform
acLon
on
group)

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
Hadoop?

•  A
scalable
fault-‐tolerant
distributed
system

for
data
storage

and
processing
(open
source
under
the
Apache
license)

•  Scalable
data
processing
engine

•  Hadoop
Distributed
File
System
(HDFS):
self-‐healing
high-‐bandwidth

clustered
storage

•  MapReduce:
fault-‐tolerant
distributed
processing

•  Key
value

•  Flexible
-‐>
store
data
without
a
schema
and
add
it
later
as
needed

•  Aﬀordable
-‐>
cost
/
TB
at
a
fracLon
of
tradiLonal
opLons

•  Broadly
adopted
-‐>
a
large
and
acLve
ecosystem

•  Proven
at
scale
-‐>
dozens
of
petabyte
+
implementaLons
in

producLon
today

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Cloudera’s
DistribuSon
Including
Apache
Hadoop

The
Industry’s
Leading
Hadoop
Distribu<on

Hue
Hue
SDK

Oozie
Oozie
Hive

Pig/

Hive

Flume,
Sqoop
HBase

Zookeeper

•  Open
source
–
100%
Apache
licensed
and
free
for
download

•  Simpliﬁed
–
Component
versions
&
dependencies
managed
for
you

•  Integrated
–
All
components
&
funcLons
interoperate
through
standard
API’s

•  Reliable
–
Patched
with
ﬁxes
from
future
releases
to
improve
stability

•  Supported
–
Employs
project
founders
and
commigers
for
>90%
of
components

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Typical
Hadoop-‐able
problems

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
common
across
Hadoop-‐able
problems?

Nature
of
the
data

•  Complex
data

•  MulLple
data
sources

•  Lots
of
it

Nature
of
the
analysis

•  Batch
processing

•  Parallelizable

Copyright
2010
Cloudera
Inc.
All
rights
reserved
13

What
kinds
of
analyses
are
possible
with
Hadoop?

•  Text
mining
•  CollaboraLve
ﬁltering

•  Index
building
•  PredicLon
models

•  Graph
creaLon
and
•  SenLment
analysis

analysis

•  Risk
assessment

•  Pagern
recogniLon

Copyright
2010
Cloudera
Inc.
All
rights
reserved
14

Top
10
Hadoop-‐able
Problems

See
archived
webinar
on
cloudera.com

1.  Modeling
True
Risk

2.  Customer
Churn
Analysis

3.  RecommendaSon
engines

4.  Ad
TargeSng

5.  Point
Of
Sale
TransacSon
Analysis

6.  Analysing
Network
Data
To
Predict
Failure

7.  Threat
Analysis/Fraud
DetecSon

8.  Trade
Surveillance

9.  Search
Quality

10.  Data
“Sandbox”

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Example:
Modeling
True
Risk

Copyright
2010
Cloudera
Inc.
All
rights
reserved
16

Example:
Modeling
True
Risk

SoluSon
with
Hadoop

•  Source,
parse
and
aggregate
disparate
data

sources
to
build
comprehensive
data
picture

•  e.g.
credit
card
records,
call
recordings,
chat

sessions,
emails,
banking
acLvity

•  Structure
and
analyze

•  SenLment
analysis,
graph
creaLon,
pagern

recogniLon

Typical
Industry

•  Financial
Services
(Banks,
Insurance)

Copyright
2010
Cloudera
Inc.
All
rights
reserved
17

Example:

Threat
Analysis

Copyright
2010
Cloudera
Inc.
All
rights
reserved
18

Example:

Threat
Analysis

SoluSon
with
Hadoop

•  Parallel
processing
over
huge
datasets

•  Pagern
recogniLon
to
idenLfy
anomalies
i.e.
threats

Typical
Industry

•  Security

•  Financial
Services

•  General:
spam
ﬁghLng,

click
fraud

Copyright
2010
Cloudera
Inc.
All
rights
reserved
19

Example:

RecommendaSon
Engine

SoluSon
with
Hadoop

•  Batch
processing
framework

•  Allow
execuLon
in
in
parallel
over
large
datasets

•  CollaboraLve
ﬁltering

•  CollecLng
‘taste’
informaLon
from
many
users

•  ULlizing
informaLon
to
predict
what
similar

users
like

Typical
Industry

•  Ecommerce,
Manufacturing,
Retail

Copyright
2010
Cloudera
Inc.
All
rights
reserved
21

Example:
Analyzing
Network
Data
to
Predict

Failure

SoluSon
with
Hadoop

•  Take
the
computaLon
to
the
data

•  Expand
the
range
of
indexing
techniques
from
simple

scans
to
more
complex
data
mining

•  Beger
understand
how
the
network
reacts
to
ﬂuctuaLons

•  How
previously
thought
discrete
anomalies
may,
in

fact,
be
interconnected

•  IdenLfy
leading
indicators
of
component
failure

Typical
Industry

•  ULliLes,
TelecommunicaLons,

Data
Centers

Copyright
2010
Cloudera
Inc.
All
rights
reserved
23

Example:
SupporSng
Hadoop
at
Cloudera

•  Collect
data
from
customer
clusters

•  OS
conﬁgs,
Hadoop
conﬁgs,
command
outputs,
logs

•  Data
served
by
HBase,
used
by
supporters

•  Consolidate
data
about
Hadoop
in
HDFS

•  Mailing
lists,
issue
trackers,
wiki
pages,
IRC,
books

•  Customer
cluster
data

•  Analyze
many
data
sources
to
understand
Hadoop

issues
and
deployments

•  Build
tools
to
enable
easier
diagnosis
or
proacLve
support

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Contact/Resources/QuesSons

•  vikram@cloudera.com

•  irc.freenode.net
#cloudera
#hadoop

•  @cloudera

•  Cloudera
Groups:
hgp://groups.cloudera.org

•  Hadoop
the
DeﬁniLve
Guide

•  10
Hadoop-‐able
problems
on
Slideshare

•  QuesLons?
(P.S.
We’re
hiring
SA’s
in
Houston!)

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera (20)

More from Mark Kerzner (20)

Recently uploaded (20)

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera