SlideShare a Scribd company logo
Hadoop -  Introduction to mapreduce
Hadoop Mapreduce
Introduction
Terminology
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
Some MapReduce Terminology
 Job – A “full program” - an execution of a Mapper and
Reducer across a data set
 Task – An execution of a Mapper or a Reducer on a slice
of data
 a.k.a. Task-In-Progress (TIP)
 Task Attempt – A particular instance of an attempt to
execute a task on a machine
Task Attempts
 A particular task will be attempted at least once,
possibly more times if it crashes
 If the same input causes crashes over and over, that input
will eventually be abandoned
 Multiple attempts at one task may occur in parallel
with speculative execution turned on
 Task ID from TaskInProgress is not a unique identifier;
don’t use it that way
MapReduce: High Level
In our case: circe.rc.usf.edu
Nodes, Trackers, Tasks
 Master node runs JobTracker instance, which accepts
Job requests from clients
 TaskTracker instances run on slave nodes
 TaskTracker forks separate Java process for task
instances
Job Distribution
 MapReduce programs are contained in a Java “jar”
file + an XML file containing serialized program
configuration options
 Running a MapReduce job places these files into
the HDFS and notifies TaskTrackers where to
retrieve the relevant program code
 … Where’s the data distribution?
Data Distribution
 Implicit in design of MapReduce!
 All mappers are equivalent; so map whatever data
is local to a particular node in HDFS
 If lots of data does happen to pile up on the
same node, nearby nodes will map instead
 Data transfer is handled implicitly by HDFS
What Happens In Hadoop?
Depth First
Job Launch Process: Client
 Client program creates a JobConf
 Identify classes implementing Mapper and
Reducer interfaces
 JobConf.setMapperClass(), setReducerClass()
 Specify inputs, outputs
 FileInputFormat.setInputPath(),
 FileOutputFormat.setOutputPath()
 Optionally, other options too:
 JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
 Pass JobConf to JobClient.runJob() or
submitJob()
 runJob() blocks, submitJob() does not
 JobClient:
 Determines proper division of input into InputSplits
 Sends job data to master JobTracker server
Job Launch Process: JobTracker
 JobTracker:
 Inserts jar and JobConf (serialized to XML) in
shared location
 Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
 TaskTrackers running on slave nodes
periodically query JobTracker for work
 Retrieve job-specific jar and config
 Launch task in separate instance of Java
 main() is provided by Hadoop
Job Launch Process: Task
 TaskTracker.Child.main():
 Sets up the child TaskInProgress attempt
 Reads XML configuration
 Connects back to necessary MapReduce
components via RPC
 Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
 TaskRunner, MapTaskRunner, MapRunner
work in a daisy-chain to launch your Mapper
 Task knows ahead of time which InputSplits it
should be mapping
 Calls Mapper once for each record retrieved from
the InputSplit
 Running the Reducer is much the same
Creating the Mapper
 You provide the instance of Mapper
 Should extend MapReduceBase
 One instance of your Mapper is initialized by
the MapTaskRunner for a TaskInProgress
 Exists in separate process from all other instances
of Mapper – no data sharing!
Mapper
 void map(K1 key,
V1 value,
OutputCollector<K2, V2> output,
Reporter reporter)
 K types implement WritableComparable
 V types implement Writable
What is Writable?
 Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
 All values are instances of Writable
 All keys are instances of WritableComparable
Hadoop -  Introduction to mapreduce
Reading Data
 Data sets are specified by InputFormats
 Defines input data (e.g., a directory)
 Identifies partitions of the data that form an
InputSplit
 Factory for RecordReader objects to extract (k, v)
records from the input source
FileInputFormat and Friends
 TextInputFormat – Treats each ‘n’-terminated line of a
file as a value
 KeyValueTextInputFormat – Maps ‘n’- terminated text
lines of “k SEP v”
 SequenceFileInputFormat – Binary file of (k, v) pairs with
some add’l metadata
 SequenceFileAsTextInputFormat – Same, but maps
(k.toString(), v.toString())
Filtering File Inputs
 FileInputFormat will read all files out of a
specified directory and send them to the
mapper
 Delegates filtering this file list to a method
subclasses may override
 e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list
Record Readers
 Each InputFormat provides its own
RecordReader implementation
 Provides (unused?) capability multiplexing
 LineRecordReader – Reads a line from a text
file
 KeyValueRecordReader – Used by
KeyValueTextInputFormat
Input Split Size
 FileInputFormat will divide large files into
chunks
 Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and
length of chunk
 Custom InputFormat implementations may
override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
 Map function receives OutputCollector object
 OutputCollector.collect() takes (k, v) elements
 Any (WritableComparable, Writable) can be
used
 By default, mapper output type assumed to
be same as reducer output type
WritableComparator
 Compares WritableComparable data
 Will call WritableComparable.compare()
 Can provide fast path for serialized data
 JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
 Reporter object sent to Mapper allows simple
asynchronous feedback
 incrCounter(Enum key, long amount)
 setStatus(String msg)
 Allows self-identification of input
 InputSplit getInputSplit()
Hadoop -  Introduction to mapreduce
Partitioner
 int getPartition(key, val, numPartitions)
 Outputs the partition number for a given key
 One partition == values sent to one Reduce task
 HashPartitioner used by default
 Uses key.hashCode() to return partition num
 JobConf sets Partitioner implementation
Reduction
 reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter )
 Keys & values sent to one partition all go to
the same reduce task
 Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
Finally: Writing The Output
OutputFormat
 Analogous to InputFormat
 TextOutputFormat – Writes “key valn” strings
to output file
 SequenceFileOutputFormat – Uses a binary
format to pack (k, v) pairs
 NullOutputFormat – Discards output
 Only useful if defining own output methods within
reduce()
Example Program - Wordcount
 map()
 Receives a chunk of text
 Outputs a set of word/count pairs
 reduce()
 Receives a key and all its associated values
 Outputs the key and the sum of the values
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
Wordcount – main( )
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Wordcount – map( )
public static class Map extends MapReduceBase … {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, …) … {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Wordcount – reduce( )
public static class Reduce extends MapReduceBase … {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, …) … {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
}
Hadoop Streaming
 Allows you to create and run map/reduce
jobs with any executable
 Similar to unix pipes, e.g.:
 format is: Input | Mapper | Reducer
 echo “this sentence has five lines” | cat | wc
Hadoop Streaming
 Mapper and Reducer receive data from stdin and output to
stdout
 Hadoop takes care of the transmission of data between the
map/reduce tasks
 It is still the programmer’s responsibility to set the correct
key/value
 Default format: “key t valuen”
 Let’s look at a Python example of a MapReduce word count
program…
Streaming_Mapper.py
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
words = line.split() # list of strings
# write data on stdout
for word in words:
print ‘%st%i’ % (word, 1)
Hadoop Streaming
 What are we outputting?
 Example output: “the 1”
 By default, “the” is the key, and “1” is the value
 Hadoop Streaming handles delivering this key/value pair to a
Reducer
 Able to send similar keys to the same Reducer or to an
intermediary Combiner
Streaming_Reducer.py
wordcount = { } # empty dictionary
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
key,value = line.split()
wordcount[key] = wordcount.get(key, 0) + value
# write data on stdout
for word, count in sorted(wordcount.items()):
print ‘%st%i’ % (word, count)
Hadoop Streaming Gotcha
 Streaming Reducer receives single lines
(which are key/value pairs) from stdin
 Regular Reducer receives a collection of all the
values for a particular key
 It is still the case that all the values for a particular
key will go to a single Reducer
Using Hadoop Distributed File System
(HDFS)
 Can access HDFS through various shell
commands (see Further Resources slide for
link to documentation)
 hadoop –put <localsrc> … <dst>
 hadoop –get <src> <localdst>
 hadoop –ls
 hadoop –rm file
Configuring Number of Tasks
 Normal method
 jobConf.setNumMapTasks(400)
 jobConf.setNumReduceTasks(4)
 Hadoop Streaming method
 -jobconf mapred.map.tasks=400
 -jobconf mapred.reduce.tasks=4
 Note: # of map tasks is only a hint to the framework. Actual
number depends on the number of InputSplits generated
Running a Hadoop Job
 Place input file into HDFS:
 hadoop fs –put ./input-file input-file
 Run either normal or streaming version:
 hadoop jar Wordcount.jar org.myorg.Wordcount input-file
output-file
 hadoop jar hadoop-streaming.jar 
-input input-file 
-output output-file 
-file Streaming_Mapper.py 
-mapper python Streaming_Mapper.py 
-file Streaming_Reducer.py 
-reducer python Streaming_Reducer.py
Submitting to RC’s GridEngine
 Add appropriate modules
 module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2
 Use the submit script posted in the Further Resources slide
 Script calls internal functions hadoop_start and hadoop_end
 Adjust the lines for transferring the input file to HDFS and starting the
hadoop job using the commands on the previous slide
 Adjust the expected runtime (generally good practice to overshoot your
estimate)
 #$ -l h_rt=02:00:00
 NOTICE: “All jobs are required to have a hard run-time specification. Jobs
that do not have this specification will have a default run-time of 10 minutes
and will be stopped at that point.”
Output Parsing
 Output of the reduce tasks must be retrieved:
 hadoop fs –get output-file hadoop-output
 This creates a directory of output files, 1 per reduce task
 Output files numbered part-00000, part-00001, etc.
 Sample output of Wordcount
 head –n5 part-00000
“’tis 1
“come 2
“coming 1
“edwin 1
“found 1
Extra Output
 The stdout/stderr streams of Hadoop itself will be stored in an output file
(whichever one is named in the startup script)
 #$ -o output.$job_id
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205
…
11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1
11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001
…
11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1
…
11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments
11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total
size: 43927 bytes
11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%
…
Thank You !!!
For More Information click below link:
Follow Us on:
http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html

More Related Content

What's hot (17)

Cppt
CpptCppt
Cppt
chunkypandey12
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
Hadoop online training
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Nitesh Ghosh
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Nitesh Ghosh
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 

Viewers also liked (13)

Hadoop story
Hadoop storyHadoop story
Hadoop story
Deep Kakkar
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Hadoop map reduce data flow
Hadoop map reduce data flowHadoop map reduce data flow
Hadoop map reduce data flow
Intellipaat
 
Map reduce
Map reduceMap reduce
Map reduce
Hyosung Jeon
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
JigsawAcademy2014
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
Jerry Jalava
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
Rajan Kanitkar
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
Hadoop online training
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Hadoop map reduce data flow
Hadoop map reduce data flowHadoop map reduce data flow
Hadoop map reduce data flow
Intellipaat
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
Jerry Jalava
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
Rajan Kanitkar
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 

Similar to Hadoop - Introduction to mapreduce (20)

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
Kelly Technologies
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
shams03159691010
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Hadoop
HadoopHadoop
Hadoop
devakalyan143
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, JobHadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Lecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduceLecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduce
anasbro009
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
Jay Nagar
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, JobHadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Lecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduceLecture 04 big data analytics | map reduce
Lecture 04 big data analytics | map reduce
anasbro009
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
Jay Nagar
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 

More from Vibrant Technologies & Computers (20)

Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5Buisness analyst business analysis overview ppt 5
Buisness analyst business analysis overview ppt 5
Vibrant Technologies & Computers
 
SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables  SQL Introduction to displaying data from multiple tables
SQL Introduction to displaying data from multiple tables
Vibrant Technologies & Computers
 
SQL- Introduction to MySQL
SQL- Introduction to MySQLSQL- Introduction to MySQL
SQL- Introduction to MySQL
Vibrant Technologies & Computers
 
SQL- Introduction to SQL database
SQL- Introduction to SQL database SQL- Introduction to SQL database
SQL- Introduction to SQL database
Vibrant Technologies & Computers
 
ITIL - introduction to ITIL
ITIL - introduction to ITILITIL - introduction to ITIL
ITIL - introduction to ITIL
Vibrant Technologies & Computers
 
Salesforce - Introduction to Security & Access
Salesforce -  Introduction to Security & Access Salesforce -  Introduction to Security & Access
Salesforce - Introduction to Security & Access
Vibrant Technologies & Computers
 
Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
Salesforce - classification of cloud computing
Salesforce - classification of cloud computingSalesforce - classification of cloud computing
Salesforce - classification of cloud computing
Vibrant Technologies & Computers
 
Salesforce - cloud computing fundamental
Salesforce - cloud computing fundamentalSalesforce - cloud computing fundamental
Salesforce - cloud computing fundamental
Vibrant Technologies & Computers
 
SQL- Introduction to PL/SQL
SQL- Introduction to  PL/SQLSQL- Introduction to  PL/SQL
SQL- Introduction to PL/SQL
Vibrant Technologies & Computers
 
SQL- Introduction to advanced sql concepts
SQL- Introduction to  advanced sql conceptsSQL- Introduction to  advanced sql concepts
SQL- Introduction to advanced sql concepts
Vibrant Technologies & Computers
 
SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data   SQL Inteoduction to SQL manipulating of data
SQL Inteoduction to SQL manipulating of data
Vibrant Technologies & Computers
 
SQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set OperationsSQL- Introduction to SQL Set Operations
SQL- Introduction to SQL Set Operations
Vibrant Technologies & Computers
 
Sas - Introduction to designing the data mart
Sas - Introduction to designing the data martSas - Introduction to designing the data mart
Sas - Introduction to designing the data mart
Vibrant Technologies & Computers
 
Sas - Introduction to working under change management
Sas - Introduction to working under change managementSas - Introduction to working under change management
Sas - Introduction to working under change management
Vibrant Technologies & Computers
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
Vibrant Technologies & Computers
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
Vibrant Technologies & Computers
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
Vibrant Technologies & Computers
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 

Recently uploaded (20)

MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptxFrom Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
Mohammad Jomaa
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AIAI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
Buhake Sindi
 
Introducing Ensemble Cloudlet vRouter
Introducing Ensemble  Cloudlet vRouterIntroducing Ensemble  Cloudlet vRouter
Introducing Ensemble Cloudlet vRouter
Adtran
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AISAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
Peter Spielvogel
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Iobit Driver Booster Pro Crack Free Download [Latest] 2025Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Mudasir
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
The fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team TopologiesThe fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team Topologies
Patricia Aas
 
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 ProfessioMaster tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Kari Kakkonen
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptxFrom Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
From Legacy to Cloud-Native: A Guide to AWS Modernization.pptx
Mohammad Jomaa
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AIAI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
Buhake Sindi
 
Introducing Ensemble Cloudlet vRouter
Introducing Ensemble  Cloudlet vRouterIntroducing Ensemble  Cloudlet vRouter
Introducing Ensemble Cloudlet vRouter
Adtran
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AISAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
Peter Spielvogel
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Iobit Driver Booster Pro Crack Free Download [Latest] 2025Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Iobit Driver Booster Pro Crack Free Download [Latest] 2025
Mudasir
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
The fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team TopologiesThe fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team Topologies
Patricia Aas
 
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 ProfessioMaster tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Kari Kakkonen
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 

Hadoop - Introduction to mapreduce

  • 3. Terminology Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper
  • 4. Some MapReduce Terminology  Job – A “full program” - an execution of a Mapper and Reducer across a data set  Task – An execution of a Mapper or a Reducer on a slice of data  a.k.a. Task-In-Progress (TIP)  Task Attempt – A particular instance of an attempt to execute a task on a machine
  • 5. Task Attempts  A particular task will be attempted at least once, possibly more times if it crashes  If the same input causes crashes over and over, that input will eventually be abandoned  Multiple attempts at one task may occur in parallel with speculative execution turned on  Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 6. MapReduce: High Level In our case: circe.rc.usf.edu
  • 7. Nodes, Trackers, Tasks  Master node runs JobTracker instance, which accepts Job requests from clients  TaskTracker instances run on slave nodes  TaskTracker forks separate Java process for task instances
  • 8. Job Distribution  MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code  … Where’s the data distribution?
  • 9. Data Distribution  Implicit in design of MapReduce!  All mappers are equivalent; so map whatever data is local to a particular node in HDFS  If lots of data does happen to pile up on the same node, nearby nodes will map instead  Data transfer is handled implicitly by HDFS
  • 10. What Happens In Hadoop? Depth First
  • 11. Job Launch Process: Client  Client program creates a JobConf  Identify classes implementing Mapper and Reducer interfaces  JobConf.setMapperClass(), setReducerClass()  Specify inputs, outputs  FileInputFormat.setInputPath(),  FileOutputFormat.setOutputPath()  Optionally, other options too:  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 12. Job Launch Process: JobClient  Pass JobConf to JobClient.runJob() or submitJob()  runJob() blocks, submitJob() does not  JobClient:  Determines proper division of input into InputSplits  Sends job data to master JobTracker server
  • 13. Job Launch Process: JobTracker  JobTracker:  Inserts jar and JobConf (serialized to XML) in shared location  Posts a JobInProgress to its run queue
  • 14. Job Launch Process: TaskTracker  TaskTrackers running on slave nodes periodically query JobTracker for work  Retrieve job-specific jar and config  Launch task in separate instance of Java  main() is provided by Hadoop
  • 15. Job Launch Process: Task  TaskTracker.Child.main():  Sets up the child TaskInProgress attempt  Reads XML configuration  Connects back to necessary MapReduce components via RPC  Uses TaskRunner to launch user process
  • 16. Job Launch Process: TaskRunner  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper  Task knows ahead of time which InputSplits it should be mapping  Calls Mapper once for each record retrieved from the InputSplit  Running the Reducer is much the same
  • 17. Creating the Mapper  You provide the instance of Mapper  Should extend MapReduceBase  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress  Exists in separate process from all other instances of Mapper – no data sharing!
  • 18. Mapper  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  K types implement WritableComparable  V types implement Writable
  • 19. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable
  • 21. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source
  • 22. FileInputFormat and Friends  TextInputFormat – Treats each ‘n’-terminated line of a file as a value  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v”  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
  • 23. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 24. Record Readers  Each InputFormat provides its own RecordReader implementation  Provides (unused?) capability multiplexing  LineRecordReader – Reads a line from a text file  KeyValueRecordReader – Used by KeyValueTextInputFormat
  • 25. Input Split Size  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk  Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
  • 26. Sending Data To Reducers  Map function receives OutputCollector object  OutputCollector.collect() takes (k, v) elements  Any (WritableComparable, Writable) can be used  By default, mapper output type assumed to be same as reducer output type
  • 27. WritableComparator  Compares WritableComparable data  Will call WritableComparable.compare()  Can provide fast path for serialized data  JobConf.setOutputValueGroupingComparator()
  • 28. Sending Data To The Client  Reporter object sent to Mapper allows simple asynchronous feedback  incrCounter(Enum key, long amount)  setStatus(String msg)  Allows self-identification of input  InputSplit getInputSplit()
  • 30. Partitioner  int getPartition(key, val, numPartitions)  Outputs the partition number for a given key  One partition == values sent to one Reduce task  HashPartitioner used by default  Uses key.hashCode() to return partition num  JobConf sets Partitioner implementation
  • 31. Reduction  reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )  Keys & values sent to one partition all go to the same reduce task  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
  • 33. OutputFormat  Analogous to InputFormat  TextOutputFormat – Writes “key valn” strings to output file  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs  NullOutputFormat – Discards output  Only useful if defining own output methods within reduce()
  • 34. Example Program - Wordcount  map()  Receives a chunk of text  Outputs a set of word/count pairs  reduce()  Receives a key and all its associated values  Outputs the key and the sum of the values package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {
  • 35. Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
  • 36. Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 37. Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } }
  • 38. Hadoop Streaming  Allows you to create and run map/reduce jobs with any executable  Similar to unix pipes, e.g.:  format is: Input | Mapper | Reducer  echo “this sentence has five lines” | cat | wc
  • 39. Hadoop Streaming  Mapper and Reducer receive data from stdin and output to stdout  Hadoop takes care of the transmission of data between the map/reduce tasks  It is still the programmer’s responsibility to set the correct key/value  Default format: “key t valuen”  Let’s look at a Python example of a MapReduce word count program…
  • 40. Streaming_Mapper.py # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%st%i’ % (word, 1)
  • 41. Hadoop Streaming  What are we outputting?  Example output: “the 1”  By default, “the” is the key, and “1” is the value  Hadoop Streaming handles delivering this key/value pair to a Reducer  Able to send similar keys to the same Reducer or to an intermediary Combiner
  • 42. Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%st%i’ % (word, count)
  • 43. Hadoop Streaming Gotcha  Streaming Reducer receives single lines (which are key/value pairs) from stdin  Regular Reducer receives a collection of all the values for a particular key  It is still the case that all the values for a particular key will go to a single Reducer
  • 44. Using Hadoop Distributed File System (HDFS)  Can access HDFS through various shell commands (see Further Resources slide for link to documentation)  hadoop –put <localsrc> … <dst>  hadoop –get <src> <localdst>  hadoop –ls  hadoop –rm file
  • 45. Configuring Number of Tasks  Normal method  jobConf.setNumMapTasks(400)  jobConf.setNumReduceTasks(4)  Hadoop Streaming method  -jobconf mapred.map.tasks=400  -jobconf mapred.reduce.tasks=4  Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated
  • 46. Running a Hadoop Job  Place input file into HDFS:  hadoop fs –put ./input-file input-file  Run either normal or streaming version:  hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-file  hadoop jar hadoop-streaming.jar -input input-file -output output-file -file Streaming_Mapper.py -mapper python Streaming_Mapper.py -file Streaming_Reducer.py -reducer python Streaming_Reducer.py
  • 47. Submitting to RC’s GridEngine  Add appropriate modules  module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2  Use the submit script posted in the Further Resources slide  Script calls internal functions hadoop_start and hadoop_end  Adjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slide  Adjust the expected runtime (generally good practice to overshoot your estimate)  #$ -l h_rt=02:00:00  NOTICE: “All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.”
  • 48. Output Parsing  Output of the reduce tasks must be retrieved:  hadoop fs –get output-file hadoop-output  This creates a directory of output files, 1 per reduce task  Output files numbered part-00000, part-00001, etc.  Sample output of Wordcount  head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1
  • 49. Extra Output  The stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)  #$ -o output.$job_id STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% …
  • 50. Thank You !!! For More Information click below link: Follow Us on: http://vibranttechnologies.co.in/hadoop-classes-in-mumbai.html