Apache hadoop: POSH Meetup Palo Alto, CA April 2014

1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved.
Intro to Hadoop: Hype or Reality –
you decide
kcrocker@gopivotal.com
Pivotal Meet-up
Kevin Crocker, Consulting Instructor, Pivotal Academy
March 19, 2014

2© Copyright 2014 Pivotal. All rights reserved.
Why is this Meet-up necessary
•  What is the future of enterprise data architecture?
–  The explosion of data
–  Volume, Variety, Velocity
–  Overruns traditional data stores
–  What is the business value of collecting all this data?

Volume
•  At a recent data conference, one participant told
the audience that they collected 7 PB of data a day
– and generated another 7 PB of data analytics
•  That’s 63 racks! A day! X 2
•  What do we even call that amount of data?
–  Data Warehouse(s), Data Store(s)
–  New Term: Data Lake

Variety
•  At the same data conference, another presenter
participated in a study using wearable medical
technology to monitor health
–  Collected 1 million readings a day = 12 readings a
second
–  when was the last time you had YOUR blood pressure
checked?
•  Toronto – so many sensors they can track millions
of cell phones over 400 square miles – 24x7

Velocity
•  Ingesting this amount of data is difficult
•  Analyzing this amount of data in traditional ways is
also difficult
–  A client recently told me that it used to take 3 weeks for
them to analyze the data from their sensors, now they
do it in 3 hours

Business Value
•  Wall Street Journal – those businesses in Toronto
pay to get summary reports of all that data and
then gear their marketing campaigns to drive new
revenue

The Data Lake Dream, Forbes, 01/14/2014
•  In an article published in Forbes, the author
mentions the term Data Lake and the technology
that addresses the problem of big data => Hadoop
•  Four levels of Hadoop Maturity
–  Life Before Hadoop -> Hadoop is Introduced ->
Growing the Data Lake -> Data Lake and Application
Cloud

So – Let’s talk about Hadoop
•  Hadoop Overview
–  Core Elements: HDFS and MapReduce
–  Ecosystem
•  HDFS Architecture
•  Hadoop MapReduce
•  Hadoop Ecosystem
•  MapReduce Primer
•  Buckle up!

Hadoop Overview

Hadoop Core
•  Based on two Google papers in 2003/4 – Google File System and MapReduce
•  Spawned off Nutch open-source web-search because of the need to store the
data
•  Open-source Apache project out of Yahoo! in January 2006
•  Distributed fault-tolerant data storage (distribution and replication of resources)
and distributed batch processing (not for random reads/writes, or updates)
•  Provides linear scalability on commodity hardware
•  Adopted by many:
–  Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter,
Yahoo!, and many, many more http://wiki.apache.org/hadoop/PoweredBy
•  Hadoop uses data redundancy rather than backup strategies

Hadoop Overview
•  Consists of:
–  Key sub-projects
•  Hadoop Common: Common utilities/tools for all Hadoop components/sub-projects
•  HDFS: A reliable, high-bandwidth, distributed file system
•  Map/Reduce: A programming framework to process large datasets
•  YARN
–  Other key Apache projects in Hadoop ecosystem
•  Avro: A data serialization system
•  Hbase/Cassandra: A scalable, distributed no-sql databases, supports structured data storage
for large tables.
•  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
•  Pig: A high-level data-flow language and execution framework for parallel computation.
•  ZooKeeper: A high-performance coordination service for distributed application
•  Latest version of Hadoop;
–  Stable and widely used latest version – V1 => 1.2.1, V2 => 2.2.0

Why?
•  Bottom line:
–  Flexible
–  Scalable
–  Inexpensive

Overview
•  Great at
–  Reliable storage for multi-petabyte data sets
–  Batch queries and analytics
–  Complex hierarchical data structures with changing
schemas, unstructured and structured data
•  Not so great at
–  Changes to files (can’t do it…) – not OLTP
–  Low-latency responses
–  Analyst usability
•  This is less of a concern now due to higher-level languages

Data Structure
•  Bytes! And more Bytes! (Peta)
•  No more ETL necessary???
•  Store data now, process later
•  Structure (schema) on read
–  Built-in support for common data types and formats
–  Extendable
–  Flexible

Versioning
•  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x
–  Two main MR packages:
•  org.apache.hadoop.mapred (deprecated)
•  org.apache.hadoop.mapreduce (new hotness)
•  Version 2.2.0, GA Oct 2013
–  NameNode HA
–  YARN – Next Gen MapReduce
–  HDFS Federation, Snapshots

HDFS Architecture

HDFS Architecture (Master/Worker)
•  HDFS Master: “Namenode”
–  Manages the filesystem namespace
–  Controls read/write access to files
–  Serves open/close/rename file requests from client
–  Manages block replication (rack-aware block placement, auto re-replication)
–  Checkpoints namespace and journals namespace changes for reliability
•  HDFS Workers: “Datanodes”
–  Serve read/write requests from clients
–  Perform replication tasks upon instruction by Namenode
–  Periodically validate the data checksum
•  HDFS Client
–  Interface available in Java, C, and command line.
–  Client computes and validates checksum stored by Datanode for data integrity check (if block
is corrupt, then other replica is accessed)

Hadoop Distributed File System
Data Model:
•  Data is organized into files and directories
•  Files are divided into uniformly-sized blocks and
distributed across cluster nodes
•  Blocks are replicated to handle hardware failure
•  Filesystem keeps checksums of data for corruption
detection and recovery
•  Read requests are always served from closest replica
•  Not strictly POSIX-compliant

Hadoop Distributed File System
•  Distributed, Fault-Tolerant & Scalable (petabyte) File System:
•  Designed to run on commodity hardware
Hardware failure is a norm (RAID-1 - Block level replication)
•  High throughput for Streaming/Sequential data access
As opposed to low latency for random I/O
•  Tuned for smaller number of large size data files
•  Simple Coherency model (Write once, read multiple times)
Append data to a file is supported in 0.19
•  Support for scalable data processing
Exposes metadata as # of block replicas and their locations etc., for scheduling
computations closer to data
•  Portability across heterogeneous HW & SW platforms
File system written in Java
•  High Availability and Namespace federation support (2.0.x-alpha)

HDFS Overview
•  Hierarchical UNIX-like file system for data storage
–  sort of (files, folders, permissions, users, groups) … but it is
a virtual file system
•  Splitting of large files into blocks
•  Distribution and replication of blocks to nodes
•  Two key services
–  Master NameNode
–  Many DataNodes
•  Checkpoint Node (Secondary NameNode)

NameNode
•  Single master service for HDFS
•  Single point of failure (HDFS 1.x; not 2.x)
•  Stores file to block to location mappings in the
namespace
•  All transactions are logged to disk
•  NameNode startup reads namespace image and
logs

Checkpoint Node (Secondary NN)
•  Performs checkpoints of the NameNode’s
namespace and logs
•  Not a hot backup!
1.  Loads up namespace
2.  Reads log transactions to modify namespace
3.  Saves namespace as a checkpoint

DataNode
•  Stores blocks on local disk
•  Sends frequent heartbeats to NameNode
•  Sends block reports to NameNode (all the block
IDs it has, checksums, etc)
•  Clients connect to DataNode for I/O

How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNode
1
Client
2
A1
3
A2 A3 A4
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially
Writes blocks to DataNode

How HDFS Works - Writes
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
DataNodes replicate data
blocks, orchestrated
by the NameNode

How HDFS Works - Reads
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
1
2
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from DataNode

NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
Client connects to another
node serving that block
How HDFS Works - Failure

Block Replication
•  Default of three replica’s
•  Rack-aware system
–  One block on same rack
–  One block on same rack,
different host
–  One block on another rack
•  Automatic re-copy by
NameNode, as needed
Rack 1
DN
DN
DN
…
Rack 2
DN
DN
DN
…

HDFS 2.0 Features
•  NameNode High-Availability (HA)
–  Two redundant NameNodes in active/passive
configuration
–  Manual or automated failover
•  NameNode Federation
–  Multiple independent NameNodes using the same
collection of DataNodes

Hadoop MapReduce

•  Programming model processing list of key/value pairs
•  Map function: processes input key/value pairs and produces set of
intermediate key/value pairs.
•  Reduce function: merges all intermediate values associated with the same
intermediate key and produces output key/value pairs.
Map-Reduce Programming Model
Input
(k1, v1)
Output
K2, List(V3)
Intermediate
Output
List (K2, V2)
Reduce
Sort or Group by K2
(K2, List(V2))
Map

Application Writer Specifies:
• Map and Reduce classes
• Input data on HDFS
• Input/Output format classes (optional)
Workflow:
•  Input phase generates a number of logical FileSplits from input files
• One Map task is created per logical file split
•  Each Map task loads Map class and executes map function to transform
input kv-pairs into a new set of kv-pairs
•  Record reader class supplied part of InputFormat reads a input record
as k-v pair
•  Map output keys are stored on local disk in sorted partitions, one per
task
•  One invocation of map function per k-v pair from an associated input
split
•  Each Reduce task fetches map output (from its associated partition) as
soon as map task finishes its processing
•  Map outputs are merged
•  One invocation of reduce function per distinct key and its associated
list of values
•  Output k-v pairs are stored on HDFS, one file per reduce task
•  Framework handles task scheduling and recovery.
Km+1…N
Output
Part-0
Output
Part-1
Input
Split 0
Input HDFS File
K1..m K1..mK1..m Km+1…N Km+1…N
Sorted Partitions
Map 0 Map 1 Map 2
Sorted Partitions Sorted Partitions
Reduce 0 Reduce 1
Shuffle
Input
Split 2
Input
Split 1
Merge & Sort Merge & Sort
Parallel Execution Model for Map-Reduce
Km+1…N

Hadoop MapReduce 1.x
•  Moves the code to the data
•  JobTracker
–  Master service to monitor jobs
•  TaskTracker
–  Multiple services to run tasks in parallel
–  Same physical machine as a DataNode
•  A job contains many tasks (One data block equals one task )
•  A task contains one or more task attempts (success = good,
failed task attempts are given to another Task Tracker for
processing: 4 single failed task attempts = one failed job)

JobTracker
•  Monitors job and task progress
•  Issues task attempts to TaskTrackers
•  Re-tries failed task attempts
•  Four failed attempts = one failed job
•  Schedules jobs in FIFO order
–  Fair Scheduler
•  Single point of failure for MapReduce

TaskTrackers
•  Runs on same node as DataNode service
•  Sends heartbeats and task reports to JobTracker
•  Configurable number of map and reduce slots
•  Runs map and reduce task attempts
–  Separate JVM!

Exploiting Data Locality
•  JobTracker will schedule task on a TaskTracker
that is local to the block
–  3 options! Because 3 replica’s
•  If TaskTracker is busy, selects TaskTracker on
same rack
–  Many options!
•  If still busy, chooses an available TaskTracker at
random – Rare!

YARN (aka MapReduce 2)
•  Abstract framework for distributed application development
•  Split functionality of JobTracker into two components
–  ResourceManager
–  ApplicationMaster
•  TaskTracker becomes NodeManager
–  Containers instead of map and reduce slots
•  Configurable amount of memory per NodeManager

How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTracker
1
Client
4
2
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
3
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output is written to
DataNodes w/replication
JobTracker reports metrics back to client

DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
How MapReduce Works - Failure
JobTracker assigns task to different node

MapReduce 2.x on YARN
•  MapReduce API has not changed
–  Rebuild required to upgrade from 1.x to 2.x
•  MapReduce History Server to store… history

YARN – Architecture
•  Client
•  Submit Job/applications
•  Resource Manager
•  Schedule resources
•  AppMaster
•  Manage/monitor lifecycle
of the M/R Job
•  Node Manager
•  Manage/monitor task
lifecycle
•  Container
•  Task JVM
•  No distinction between
map and reduce tasks

YARN – Map/Reduce

Hadoop Ecosystem

Hadoop Ecosystem
•  Core Technologies
–  Hadoop Distributed File System
–  Hadoop MapReduce
•  Many other tools…
–  Which I will be describing… now

Moving Data
•  Sqoop
–  Moving data between RDBMS and HDFS
–  Say, migrating MySQL tables to HDFS
•  Flume
–  Streams event data from sources to sinks
–  Say, weblogs from multiple servers into HDFS

Flume Architecture

Higher Level APIs
•  Pig
–  Data-flow language – aptly named PigLatin -- to
generate one or more MapReduce jobs against data
stored locally or in HDFS
•  Hive
–  Data warehousing solution, allowing users to write
SQL-like queries to generate a series of MapReduce
jobs against data stored in HDFS

Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';

Key/Value Stores
•  HBase
•  Accumulo
•  Implementations of Google’s Big Table for HDFS
•  Provides random, real-time access to big data
•  Supports updates and deletes of key/value pairs

HBase Architecture
MasterZooKeeper
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
Client
HDFS
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile

Data Structure
•  Avro
–  Data serialization system designed for the Hadoop
ecosystem
–  Expressed as JSON
•  Parquet
–  Compressed, efficient columnar storage for Hadoop
and other systems

Scalable Machine Learning
•  Mahout
–  Library for scalable machine learning written in Java
–  Very robust examples!
–  Classification, Clustering, Pattern Mining, Collaborative
Filtering, and much more

Workflow Management
•  Oozie
–  Scheduling system for Hadoop Jobs
–  Support for:
•  Java MapReduce
•  Streaming MapReduce
•  Pig, Hive, Sqoop, Distcp
•  Any ol’ Java or shell script program

Real-time Stream Processing
•  Storm
–  Open-source project
which runs a streaming
of data, called a spout,
to a series of execution
agents called bolts
–  Scalable and fault-
tolerant, with guaranteed
processing of data
–  Benchmarks of over a
million tuples processed
per second per node

Distributed Application Coordination
•  ZooKeeper
–  An effort to develop and
maintain an open-source
server which enables
highly reliable distributed
coordination
–  Designed to be simple,
replicated, ordered, and
fast
–  Provides configuration
management, distributed
synchronization, and group
services for applications

ZooKeeper Architecture

Hadoop Streaming
•  Can define Mapper and Reduce using Unix text filters
–  Typically use grep, sed, python, or perl scripts
•  Format for input and output is: key t value n
•  Allows for easy debugging and experimentation
•  Slower than Java programs
•  bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -
reducer streamingReducer.sh
–  Mapper: /bin/sed -e 's| |n|g' | /bin/grep .
–  Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "t" $1}'

Hadoop Streaming Architecture
JobTracker (Master)
TaskTracker
(Slave)
Map Task
TaskTracker
(Slave)
Mapper
Executable
I/P
HDFS
File
STDOUT
STDIN
Reduce Task
Reducer
Executable
STDOUT
STDIN
O/P
HDFS
File
K t V
http://hadoop.apache.org/docs/stable/streaming.html

SQL on Hadoop
•  Apache Drill
•  Cloudera Impala
•  Hive Stinger
•  Pivotal HAWQ
•  MPP execution of SQL queries against HDFS data

That’s a lot of projects
•  I am likely missing several (Sorry, guys!)
•  Each cropped up to solve a limitation of Hadoop
Core
•  Know your ecosystem
•  Pick the right tool for the right job

Sample Architecture
HDFS
Flume
Agent
Flume
Agent
Flume
Agent
MapReduce Pig HBase Storm
Website
Oozie
Webserver
Sales
Call Center SQL
SQL

MapReduce Primer

MapReduce Paradigm
•  Data processing system with two key phases
•  Map
–  Perform a map function on input key/value pairs to
generate intermediate key/value pairs
•  Reduce
–  Perform a reduce function on intermediate key/value
groups to generate output key/value pairs
•  Groups created by sorting map output

Reduce Task 0 Reduce Task 1
Map Task 0 Map Task 1 Map Task 2
(0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun")
("hadoop", 1)
("is", 1)
("fun", 1)
("I", 1)
("love", 1)
("hadoop", 1)
("Pig", 1)
("is", 1)
("more", 1)
("fun", 1)
("hadoop", {1,1})
("is", {1,1})
("fun", {1,1})
("love", {1})
("I", {1})
("Pig", {1})
("more", {1})
("hadoop", 2)
("fun", 2)
("love", 1)
("I", 1)
("is", 2)
("Pig", 1)
("more", 1)
SHUFFLE AND SORT
Map Input
Map Output
Reducer Input Groups
Reducer Output

Hadoop MapReduce Components
•  Map Phase
–  Input Format
–  Record Reader
–  Mapper
–  Combiner
–  Partitioner
•  Reduce Phase
–  Shuffle
–  Sort
–  Reducer
–  Output Format
–  Record Writer

Writable Interfaces
public interface Writable {"
"
void write(DataOutput out);"
void readFields(DataInput in);"
}"
"
public interface WritableComparable<T> extends Writable, Comparable<T> {"
}"
•  BooleanWritable
•  BytesWritable
•  ByteWritable
•  DoubleWritable
•  FloatWritable
•  IntWritable
•  LongWritable
•  NullWritable
•  Text

InputFormat
"
"
public abstract class InputFormat<K, V> {"
"
public abstract List<InputSplit> getSplits(JobContext context);"
"
public abstract RecordReader<K, V>"
"createRecordReader(InputSplit split, TaskAttemptContext context);"
}"

RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {"
"
public abstract void initialize(InputSplit split, TaskAttemptContext context);"
"
public abstract boolean nextKeyValue();"
"
public abstract KEYIN getCurrentKey();"
"
public abstract VALUEIN getCurrentValue();"
"
public abstract ﬂoat getProgress();"
"
public abstract void close();"
}"

Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protected void setup(Context context) { /* NOTHING */ }"
protected void cleanup(Context context) { /* NOTHING */ }"
"
protected void map(KEYIN key, VALUEIN value, Context context) {"
context.write((KEYOUT) key, (VALUEOUT) value);"
}"
"
public void run(Context context) {"
setup(context);"
while (context.nextKeyValue())"
map(context.getCurrentKey(), context.getCurrentValue(), context);"
cleanup(context);"
}"
}"

Partitioner
"
"
public abstract class Partitioner<KEY, VALUE> {"
"
public abstract int getPartition(KEY key, VALUE value, int numPartitions);"
"
}"
"
•  Default HashPartitioner uses key’s hashCode() % numPartitions

Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protected void setup(Context context) { /* NOTHING */ }"
protected void cleanup(Context context) { /* NOTHING */ }"
"
protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {"
for (VALUEIN value : values)"
context.write((KEYOUT) key, (VALUEOUT) value);"
}"
"
public void run(Context context) {"
setup(context);"
while (context.nextKey())"
reduce(context.getCurrentKey(), context.getValues(), context);"
cleanup(context);"
}"
}"

OutputFormat
"
"
public abstract class OutputFormat<K, V> {"
"
public abstract RecordWriter<K, V>"
" " "getRecordWriter(TaskAttemptContext context);"
"
public abstract void checkOutputSpecs(JobContext context);"
"
public abstract OutputCommitter"
" " "getOutputCommitter(TaskAttemptContext context);"
}"

RecordWriter
"
"
public abstract class RecordWriter<K, V> {"
"
public abstract void write(K key, V value);"
"
public abstract void close(TaskAttemptContext context);"
}"

Some M/R Concepts / knobs
•  Configuration
–  {hdfs,yarn,mapred}-default.xml -- default config (contain both services & client config)
–  {hdfs,yarn,mapred}-site.xml -- Service config used for cluster specific over-rides,
–  {hdfs,yarn,mapred}-client.xml -- Client specific config
•  Input/Output Formats
–  TextFileInputFormat, KeyValueTextFileInputFormat, NLineInputFormat, SequenceFileInputFormat
–  Pluggable input/output formats provide ability for Jobs to read/write data in different formats
–  Major function
•  getSplits
•  RecordReader
•  Schedulers
–  Pluggable resource scheduler used by Resource Manager
–  Default, Capacity Scheduler & Fair scheduler
•  Combiner
–  Combine individual map output before sending to reducer
–  Lowers intermediate data
•  Partitioner
–  Pluggable class to partition the map output among number of reducers

Some M/R knobs
•  Compression
–  Enable compression of Map/Reduce output
–  Gzip, lzo, bz2 codecs available with framework
•  Counters
–  Ability to keep track of various job statistics e.g. num bytes read, written
–  Available for each task and also aggregated per job.
–  Job can write its own custom counters
•  Speculative Executions
–  Provides task recovery against hardware issues
•  Distributed cache
–  Ability to make job specific data available to each
•  Tool – M/R application helper classes, Support ability for job to accept generic options, e.g.
–  -conf <configuration file> specify an application configuration file
–  -D <property=value> use value for given property
–  -fs <local|namenode:port> specify a namenode
–  -jt <local|jobtracker:port> specify a job tracker
–  -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
–  -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
–  -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

Word Count Example

Problem
•  Count the number of
times each word is used
in a body of text
•  Uses TextInputFormat
and TextOutputFormat
map(byte_offset, line)
foreach word in line
emit(word, 1)
reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)

Word Count Example

Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ "
private ﬁnal static IntWritable ONE = new IntWritable(1);"
private Text word = new Text();"
"
public void map(LongWritable key, Text value, Context context) {"
String line = value.toString();"
StringTokenizer tokenizer = new StringTokenizer(line);"
"
while (tokenizer.hasMoreTokens()) {"
word.set(tokenizer.nextToken());"
context.write(word, ONE);"
}"
}"
}"

Shuffle and Sort
P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3
2
3
P0 P1 P2 P3
Reducer 0 Reducer 1 Reducer 2 Reducer 3
Mapper 0 Mapper 1 Mapper 2 Mapper 3
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,
sorting by key

Reducer Code
public class IntSumReducer"
"extends Reducer<Text, LongWritable, Text, IntWritable> {"
private IntWritable outvalue = new IntWritable();"
private int sum = 0;"
"
public void reduce(Text key, Iterable<IntWritable> values, Context context) {"
sum = 0;"
for (IntWritable val : values) {"
sum += val.get();"
}"
outvalue.set(sum);"
context.write(key, outvalue);"
}"
}"

So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems you'll
ever have ever

So what’s so hard about it?
•  MapReduce is a limitation
•  Entirely different way of thinking
•  Simple processing operations such as joins are not so
easy when expressed in MapReduce
•  Proper implementation is not so easy
•  Lots of configuration and implementation details for
optimal performance
–  Number of reduce tasks, data skew, JVM size, garbage
collection

So what does this mean for you?
•  Hadoop is written primarily in Java
•  Components are extendable and configurable
•  Custom I/O through Input and Output Formats
–  Parse custom data formats
–  Read and write using external systems
•  Higher-level tools enable rapid development of big
data analysis

Resources, Wrap-up, etc.
•  http://hadoop.apache.org
•  Very supportive community
•  Plenty of resources available to learn more
–  Blogs
–  Email lists
–  Books
–  Shameless Plug -- MapReduce Design Patterns

Getting Started
•  Pivotal HD Single-Node VM and Community
Edition
–  http://gopivotal.com/pivotal-products/data/pivotal-hd
•  For the brave and bold -- Roll-your-own!
–  http://hadoop.apache.org/docs/current

Acknowledgements
•  Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout,
Oozie, Pig, Sqoop, YARN, and ZooKeeper are
trademarks of the Apache Software Foundation
•  Cloudera Impala is a trademark of Cloudera
•  Parquet is copyright Twitter, Cloudera, and other
contributors
•  Storm is licensed under the Eclipse Public License

•  Talk to us on Twitter: @mewzherder (Tamao, not
me)
•  Sign up for more Hadoop
–  http://bit.ly/POSH0018
•  Pivotal Education
–  http://www.gopivotal.com/training
Learn More. Stay Connected.

Questions ??

Apache hadoop: POSH Meetup Palo Alto, CA April 2014

Recommended

More Related Content

What's hot (20)

Similar to Apache hadoop: POSH Meetup Palo Alto, CA April 2014 (20)

Recently uploaded (20)

Apache hadoop: POSH Meetup Palo Alto, CA April 2014