Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

© 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
Hadoop
Just the Basics for Big Data Rookies
Adam Shook
ashook@gopivotal.com

Agenda
• Hadoop Overview
• HDFS Architecture
• Hadoop MapReduce
• Hadoop Ecosystem
• MapReduce Primer
• Buckle up!

Hadoop Core
• Open-source Apache project out of Yahoo! in 2006
• Distributed fault-tolerant data storage and batch
processing
• Provides linear scalability on commodity hardware
• Adopted by many:
– Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM,
Netflix, Twitter, Yahoo!, and many, many more

Why?
• Bottom line:
– Flexible
– Scalable
– Inexpensive

Overview
• Great at
– Reliable storage for multi-petabyte data sets
– Batch queries and analytics
– Complex hierarchical data structures with changing
schemas, unstructured and structured data
• Not so great at
– Changes to files (can’t do it…)
– Low-latency responses
– Analyst usability
• This is less of a concern now due to higher-level languages

Data Structure
• Bytes!
• No more ETL necessary
• Store data now, process later
• Structure on read
– Built-in support for common data types and formats
– Extendable
– Flexible

Versioning
• Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
– Two main MR packages:
• org.apache.hadoop.mapred (deprecated)
• org.apache.hadoop.mapreduce (new hotness)
• Version 2.x.x, alpha’d in May 2012
– NameNode HA
– YARN – Next Gen MapReduce

HDFS Overview
• Hierarchical UNIX-like file system for data storage
– sort of
• Splitting of large files into blocks
• Distribution and replication of blocks to nodes
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)

NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the namespace
• All transactions are logged to disk
• NameNode startup reads namespace image and logs

Checkpoint Node (Secondary NN)
• Performs checkpoints of the NameNode’s namespace and
logs
• Not a hot backup!
1. Loads up namespace
2. Reads log transactions to modify namespace
3. Saves namespace as a checkpoint

DataNode
• Stores blocks on local disk
• Sends frequent heartbeats to NameNode
• Sends block reports to NameNode
• Clients connect to DataNode for I/O

How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNode
1
Client
2
A1
3
A2 A3 A4
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially
writes blocks to DataNode

How HDFS Works - Writes
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
DataNodes replicate data
blocks, orchestrated
by the NameNode

How HDFS Works - Reads
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
1
2
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from DataNode

NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
Client connects to another
node serving that block
How HDFS Works - Failure

Block Replication
• Default of three replicas
• Rack-aware system
– One block on same rack
– One block on same rack,
different host
– One block on another rack
• Automatic re-copy by
NameNode, as needed
Rack 1
DN
DN
DN
…
Rack 2
DN
DN
DN
…

HDFS 2.0 Features
• NameNode High-Availability (HA)
– Two redundant NameNodes in active/passive configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same collection
of DataNodes

Hadoop MapReduce 1.x
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts

JobTracker
• Monitors job and task progress
• Issues task attempts to TaskTrackers
• Re-tries failed task attempts
• Four failed attempts = one failed job
• Schedules jobs in FIFO order
– Fair Scheduler
• Single point of failure for MapReduce

TaskTrackers
• Runs on same node as DataNode service
• Sends heartbeats and task reports to JobTracker
• Configurable number of map and reduce slots
• Runs map and reduce task attempts
– Separate JVM!

Exploiting Data Locality
• JobTracker will schedule task on a TaskTracker that is
local to the block
– 3 options!
• If TaskTracker is busy, selects TaskTracker on same rack
– Many options!
• If still busy, chooses an available TaskTracker at random
– Rare!

How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTracker
1
Client
4
2
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
3
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output is written to
DataNodes w/replication
JobTracker reports metrics

DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
How MapReduce Works - Failure
JobTracker assigns task to different node

YARN
• Abstract framework for distributed application
development
• Split functionality of JobTracker into two components
– ResourceManager
– ApplicationMaster
• TaskTracker becomes NodeManager
– Containers instead of map and reduce slots
• Configurable amount of memory per NodeManager

MapReduce 2.x on YARN
• MapReduce API has not changed
– Rebuild required to upgrade from 1.x to 2.x
• Application Master launches and monitors job via YARN
• MapReduce History Server to store… history

Hadoop Ecosystem
• Core Technologies
– Hadoop Distributed File System
– Hadoop MapReduce
• Many other tools…
– Which I will be describing… now

Moving Data
• Sqoop
– Moving data between RDBMS and HDFS
– Say, migrating MySQL tables to HDFS
• Flume
– Streams event data from sources to sinks
– Say, weblogs from multiple servers into HDFS

Higher Level APIs
• Pig
– Data-flow language – aptly named PigLatin -- to generate
one or more MapReduce jobs against data stored locally or
in HDFS
• Hive
– Data warehousing solution, allowing users to write SQL-like
queries to generate a series of MapReduce jobs against
data stored in HDFS

Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';

Key/Value Stores
• HBase
• Accumulo
• Implementations of Google’s Big Table for HDFS
• Provides random, real-time access to big data
• Supports updates and deletes of key/value pairs

HBase Architecture
MasterZooKeeper
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
Client
HDFS
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile

Data Structure
• Avro
– Data serialization system designed for the Hadoop
ecosystem
– Expressed as JSON
• Parquet
– Compressed, efficient columnar storage for Hadoop and
other systems

Scalable Machine Learning
• Mahout
– Library for scalable machine learning written in Java
– Very robust examples!
– Classification, Clustering, Pattern Mining, Collaborative
Filtering, and much more

Workflow Management
• Oozie
– Scheduling system for Hadoop Jobs
– Support for:
• Java MapReduce
• Streaming MapReduce
• Pig, Hive, Sqoop, Distcp
• Any ol’ Java or shell script program

Real-time Stream Processing
• Storm
– Open-source project
which runs a streaming
of data, called a spout,
to a series of execution
agents called bolts
– Scalable and fault-
tolerant, with guaranteed
processing of data
– Benchmarks of over a
million tuples processed
per second per node

Distributed Application Coordination
• ZooKeeper
– An effort to develop and
maintain an open-source
server which enables
highly reliable distributed
coordination
– Designed to be simple,
replicated, ordered, and
fast
– Provides configuration
management, distributed
synchronization, and group
services for applications

Hadoop Streaming
• Write MapReduce mappers and reducers using stdin and
stdout
• Execute on command line using Hadoop Streaming JAR
// TODO verify
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

SQL on Hadoop
• Apache Drill
• Cloudera Impala
• Hive Stinger
• Pivotal HAWQ
• MPP execution of SQL queries against HDFS data

That’s a lot of projects
• I am likely missing several (Sorry, guys!)
• Each cropped up to solve a limitation of Hadoop Core
• Know your ecosystem
• Pick the right tool for the right job

Sample Architecture
HDFS
Flume
Agent
Flume
Agent
Flume
Agent
MapReduce Pig HBase Storm
Website
Oozie
Webserve
r
Sales
Call Center SQL
SQL

MapReduce Paradigm
• Data processing system with two key phases
• Map
– Perform a map function on input key/value pairs to generate
intermediate key/value pairs
• Reduce
– Perform a reduce function on intermediate key/value groups
to generate output key/value pairs
• Groups created by sorting map output

Reduce Task 0 Reduce Task 1
Map Task 0 Map Task 1 Map Task 2
(0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun")
("hadoop", 1)
("is", 1)
("fun", 1)
("I", 1)
("love", 1)
("hadoop", 1)
("Pig", 1)
("is", 1)
("more", 1)
("fun", 1)
("hadoop", {1,1})
("is", {1,1})
("fun", {1,1})
("love", {1})
("I", {1})
("Pig", {1})
("more", {1})
("hadoop", 2)
("fun", 2)
("love", 1)
("I", 1)
("is", 2)
("Pig", 1)
("more", 1)
SHUFFLE AND SORT
Map Input
Map Output
Reducer Input Groups
Reducer Output

Hadoop MapReduce Components
• Map Phase
– Input Format
– Record Reader
– Mapper
– Combiner
– Partitioner
• Reduce Phase
– Shuffle
– Sort
– Reducer
– Output Format
– Record Writer

Writable Interfaces
public interface Writable {
void write(DataOutput out);
void readFields(DataInput in);
}
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
• BooleanWritable
• BytesWritable
• ByteWritable
• DoubleWritable
• FloatWritable
• IntWritable
• LongWritable
• NullWritable
• Text

InputFormat
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context);
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext context);
}

RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {
public abstract void initialize(InputSplit split, TaskAttemptContext context);
public abstract boolean nextKeyValue();
public abstract KEYIN getCurrentKey();
public abstract VALUEIN getCurrentValue();
public abstract float getProgress();
public abstract void close();
}

Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void setup(Context context) { /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }
protected void map(KEYIN key, VALUEIN value, Context context) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) {
setup(context);
while (context.nextKeyValue())
map(context.getCurrentKey(), context.getCurrentValue(), context);
cleanup(context);
}
}

Partitioner
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
• Default HashPartitioner uses key’s hashCode() % numPartitions

Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void setup(Context context) { /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }
protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {
for (VALUEIN value : values)
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) {
setup(context);
while (context.nextKey())
reduce(context.getCurrentKey(), context.getValues(), context);
cleanup(context);
}
}

OutputFormat
public abstract class OutputFormat<K, V> {
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context);
public abstract void checkOutputSpecs(JobContext context);
public abstract OutputCommitter
getOutputCommitter(TaskAttemptContext context);
}

RecordWriter
public abstract class RecordWriter<K, V> {
public abstract void write(K key, V value);
public abstract void close(TaskAttemptContext context);
}

Problem
• Count the number of times
each word is used in a
body of text
• Uses TextInputFormat and
TextOutputFormat
map(byte_offset, line)
foreach word in line
emit(word, 1)
reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)

Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}

Shuffle and Sort
P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3
2
3
P0 P1 P2 P3
Reducer 0 Reducer 1 Reducer 2 Reducer 3
Mapper 0 Mapper 1 Mapper 2 Mapper 3
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,
sorting by key

Reducer Code
public class IntSumReducer
extends Reducer<Text, LongWritable, Text, IntWritable> {
private IntWritable outvalue = new IntWritable();
private int sum = 0;
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
outvalue.set(sum);
context.write(key, outvalue);
}
}

So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems you'll
ever have ever

So what’s so hard about it?
• MapReduce is a limitation
• Entirely different way of thinking
• Simple processing operations such as joins are not so
easy when expressed in MapReduce
• Proper implementation is not so easy
• Lots of configuration and implementation details for
optimal performance
– Number of reduce tasks, data skew, JVM size, garbage
collection

So what does this mean for you?
• Hadoop is written primarily in Java
• Components are extendable and configurable
• Custom I/O through Input and Output Formats
– Parse custom data formats
– Read and write using external systems
• Higher-level tools enable rapid development of big data
analysis

Resources, Wrap-up, etc.
• http://hadoop.apache.org
• Very supportive community
• Strata + Hadoop World Oct. 28th – 30th in Manhattan
• Plenty of resources available to learn more
– Blogs
– Email lists
– Books
– Shameless Plug -- MapReduce Design Patterns

Getting Started
• Pivotal HD Single-Node VM and Community Edition
– http://gopivotal.com/pivotal-products/data/pivotal-hd
• For the brave and bold -- Roll-your-own!
– http://hadoop.apache.org/docs/current

Acknowledgements
• Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie,
Pig, Sqoop, YARN, and ZooKeeper are trademarks of the
Apache Software Foundation
• Cloudera Impala is a trademark of Cloudera
• Parquet is copyright Twitter, Cloudera, and other
contributors
• Storm is licensed under the Eclipse Public License

Learn More. Stay Connected.
• Talk to us on Twitter: @springcentral
• Find Session replays on YouTube: spring.io/video

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) (20)

More from VMware Tanzu (20)

Recently uploaded (20)

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Editor's Notes