0% found this document useful (0 votes)

4 views

Map Reduce

Gshshsvajsvab sabbs

Uploaded by

alisubi727

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Map Reduce

Gshshsvajsvab sabbs

Uploaded by

alisubi727

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

MapReduce: Processing Large Scale Data in Hadoop

MapReduce is a programming model and processing technique used for handling and processing
large datasets in a distributed computing environment. It was developed by Google and later
adopted and popularized by the Apache Hadoop project. This model allows for the processing of
vast amounts of data across a distributed cluster of servers. Below, we delve into the details of
how MapReduce processes large-scale data, providing a detailed example to illustrate its
functionality.

Key Concepts of MapReduce

The MapReduce model is composed of two primary functions: Map and Reduce. These functions
are inspired by functional programming paradigms and are used to process data in parallel across
a distributed cluster.

1. Map Function

The Map function takes an input key-value pair and processes it to generate a set of intermediate
key-value pairs. The signature of the Map function typically looks like this:

map(key, value) -> list of (key', value')

Each mapper operates independently on the input data, allowing for parallel processing across
multiple nodes.

2. Reduce Function

The Reduce function takes the intermediate key-value pairs produced by the Map function and
merges all values associated with the same key. The signature of the Reduce function is:

reduce(key', list of values) -> list of (key', value')

This function aggregates the results, producing the final output.

Processing Workflow of MapReduce

1. Input Splitting: The input data is split into fixed-size chunks (typically 64MB or
128MB), which are then processed independently by the mappers. This division ensures
parallel processing.
2. Mapping: Each split is processed by a Map task, which transforms the input data into
intermediate key-value pairs. This phase is highly parallelizable, as each Map task
operates independently.
3. Shuffling and Sorting: The intermediate key-value pairs generated by the Map tasks are
shuffled and sorted by key. This step ensures that all values associated with a particular
key are grouped together and sent to the same Reducer.
4. Reducing: The Reduce tasks receive the grouped intermediate key-value pairs and
aggregate them to produce the final output. This phase combines the results from the
Mappers, ensuring that each key is processed by a single Reducer.
5. Output: The results of the Reduce tasks are written to the distributed file system (HDFS
in the case of Hadoop).

Example: Word Count Using MapReduce

To illustrate how MapReduce works, let's consider a classic example: counting the occurrences
of each word in a large collection of text documents.

1. Input Data

Assume we have a set of text files containing the following lines:

file1.txt: "Hello world"

file2.txt: "Hello Hadoop"
file3.txt: "Hello MapReduce"

2. Map Function

The Map function processes each line of the text files and generates intermediate key-value pairs
where the key is a word and the value is 1.

def map(key, value):

words = value.split()
for word in words:
yield (word, 1)

For the above input data, the intermediate key-value pairs produced by the Map function would
be:

("Hello", 1)
("world", 1)
("Hello", 1)
("Hadoop", 1)
("Hello", 1)
("MapReduce", 1)

3. Shuffle and Sort

The intermediate key-value pairs are shuffled and sorted by key:

("Hadoop", [1])
("Hello", [1, 1, 1])
("MapReduce", [1])
("world", [1])
4. Reduce Function

The Reduce function aggregates the values for each key to produce the final word count:

def reduce(key, values):

total = sum(values)
yield (key, total)

The final output after the Reduce phase would be:

("Hadoop", 1)
("Hello", 3)
("MapReduce", 1)
("world", 1)

Justification and Benefits of MapReduce

1. Scalability

MapReduce is highly scalable, enabling the processing of petabytes of data by distributing the
workload across many nodes in a cluster. Each node processes a subset of the data, making it
possible to handle massive datasets efficiently.

2. Fault Tolerance

Hadoop's implementation of MapReduce includes robust fault tolerance mechanisms. If a node

fails during processing, the system automatically reruns the task on another node, ensuring
reliability and continuous processing without manual intervention.

3. Parallel Processing

By design, MapReduce allows for parallel processing. The Map tasks run concurrently on
different nodes, significantly speeding up the data processing time. Similarly, the Reduce tasks
operate in parallel, further enhancing performance.

4. Simplicity

The MapReduce model abstracts the complexity of distributed computing from the programmer.
Developers only need to focus on writing the Map and Reduce functions, while the underlying
framework handles data distribution, parallelization, and fault tolerance.

5. Flexibility

MapReduce can be used for a wide range of data processing tasks beyond word counting,
including log analysis, data transformations, and more complex computations like graph
processing and machine learning.

Wazuh-Elastic Stack Training: Deck 1
100% (2)
Wazuh-Elastic Stack Training: Deck 1
66 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Data Science
No ratings yet
Data Science
7 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Unit 3 Map Reduce
No ratings yet
Unit 3 Map Reduce
3 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
Map reduce
No ratings yet
Map reduce
35 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Bda 03
No ratings yet
Bda 03
10 pages
What is Map Reduce Programming Model_ Explain.
No ratings yet
What is Map Reduce Programming Model_ Explain.
3 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
bda megh
No ratings yet
bda megh
50 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Fundamentals of MapReduce With Example
No ratings yet
Fundamentals of MapReduce With Example
2 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
BDP 2024 08
No ratings yet
BDP 2024 08
14 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
MODULE 20 Management and Outsourcing Practices
No ratings yet
MODULE 20 Management and Outsourcing Practices
7 pages
Syallbus of Specialization in Robotics and Machine Learning
No ratings yet
Syallbus of Specialization in Robotics and Machine Learning
24 pages
Innovation Insight F 756081 NDX
No ratings yet
Innovation Insight F 756081 NDX
15 pages
The Cool Stuff in Premiere Pro
No ratings yet
The Cool Stuff in Premiere Pro
12 pages
Codings 1
No ratings yet
Codings 1
12 pages
Case Study-Application Developed in C
No ratings yet
Case Study-Application Developed in C
9 pages
ICT2613 S1 MayJune 2017 ExaminationPaper
No ratings yet
ICT2613 S1 MayJune 2017 ExaminationPaper
11 pages
Cloud-As-The-Ultimate-Digital-Enabler Article
No ratings yet
Cloud-As-The-Ultimate-Digital-Enabler Article
3 pages
CPI B2B on SAP Integration Suite Cloud Integration CPI using Trading Partner Management Part 3 EDI over AS2 to IDoc Flows
No ratings yet
CPI B2B on SAP Integration Suite Cloud Integration CPI using Trading Partner Management Part 3 EDI over AS2 to IDoc Flows
66 pages
Sarix IMP III Series Dome Installation Manual English
No ratings yet
Sarix IMP III Series Dome Installation Manual English
24 pages
A. MULTIPLE CHOICE: Encircle The Letter of The Correct Answer
No ratings yet
A. MULTIPLE CHOICE: Encircle The Letter of The Correct Answer
2 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
612-586367 MANUAL GPS
No ratings yet
612-586367 MANUAL GPS
114 pages
VMS Preventive Maintenance Specification and Form
No ratings yet
VMS Preventive Maintenance Specification and Form
8 pages
Smart sp004 - en P
No ratings yet
Smart sp004 - en P
12 pages
User Agent
No ratings yet
User Agent
3 pages
Android Programming - Lecture 2
No ratings yet
Android Programming - Lecture 2
14 pages
C Practical Lab File
No ratings yet
C Practical Lab File
36 pages
Ai-(Ix)-Practice Paper 2 New
No ratings yet
Ai-(Ix)-Practice Paper 2 New
5 pages
Basic Excel Course Outline
No ratings yet
Basic Excel Course Outline
4 pages
Deutsche Bank Frankfurt
No ratings yet
Deutsche Bank Frankfurt
6 pages
Bugreport 2023 04 22 21 14 07 Dyumpstate - Log 7528
No ratings yet
Bugreport 2023 04 22 21 14 07 Dyumpstate - Log 7528
124 pages
Exploring Basic Widgets: Center Constructors Center ( (Key Key, Widthfactor, Heightfactor, Widget Child) )
No ratings yet
Exploring Basic Widgets: Center Constructors Center ( (Key Key, Widthfactor, Heightfactor, Widget Child) )
16 pages
Empowerment Technology: Reflecting On The ICT Learning Process
No ratings yet
Empowerment Technology: Reflecting On The ICT Learning Process
7 pages
Solved Question Bank_Database concepts
No ratings yet
Solved Question Bank_Database concepts
27 pages
Icse 10 Computer Chinni TR Note
No ratings yet
Icse 10 Computer Chinni TR Note
8 pages
Lesson 7 Homework Practice Independent and Dependent Events Answers
100% (1)
Lesson 7 Homework Practice Independent and Dependent Events Answers
7 pages
Chapter 2 Page Development PDF
No ratings yet
Chapter 2 Page Development PDF
9 pages
Standard Operating Procedures (SOPs) for e-Voting sytem
No ratings yet
Standard Operating Procedures (SOPs) for e-Voting sytem
9 pages