Unit 3 MapReduce Part 1
Unit 3 MapReduce Part 1
Agenda MapReduce:
1. Data Flow
2. Map
3. Shuffle
4. Sort
5. Reduce,
6. Hadoop Streaming,
7. mrjob,
8. Installation
9. wordcount in mrjob
10. Executing mrjob
What is MapReduce?
History :
MapReduce was developed in the walls of Google back in 2004
by Jeffery Dean and Sanjay Ghemawat of Google (Dean &
Ghemawat, 2004). In their paper, “MAPREDUCE: SIMPLIFIED
DATA PROCESSING ON LARGE CLUSTERS,” and was inspired by
the map and reduce functions commonly used in functional
programming.
What is MapReduce?
Hadoop MapReduce is the data processing layer. It processes the huge
amount of structured and unstructured data stored in HDFS. MapReduce
processes data in parallel by dividing the job into the set of independent tasks.
So, parallel processing improves speed and reliability.
Hadoop MapReduce data processing takes place in 2 phases- Map and Reduce
phase.
•Map phase- It is the first phase of data processing. In this phase, we specify all
the complex logic/business rules/costly code.
•Reduce phase- It is the second phase of processing. In this phase, we specify
light-weight processing like aggregation/summation.
MapReduce programming offers several benefits to help you gain
valuable insights from your big data: