0% found this document useful (0 votes)
4 views

Map Reduce

Gshshsvajsvab sabbs

Uploaded by

alisubi727
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Map Reduce

Gshshsvajsvab sabbs

Uploaded by

alisubi727
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

MapReduce: Processing Large Scale Data in Hadoop

MapReduce is a programming model and processing technique used for handling and processing
large datasets in a distributed computing environment. It was developed by Google and later
adopted and popularized by the Apache Hadoop project. This model allows for the processing of
vast amounts of data across a distributed cluster of servers. Below, we delve into the details of
how MapReduce processes large-scale data, providing a detailed example to illustrate its
functionality.

Key Concepts of MapReduce

The MapReduce model is composed of two primary functions: Map and Reduce. These functions
are inspired by functional programming paradigms and are used to process data in parallel across
a distributed cluster.

1. Map Function

The Map function takes an input key-value pair and processes it to generate a set of intermediate
key-value pairs. The signature of the Map function typically looks like this:

map(key, value) -> list of (key', value')

Each mapper operates independently on the input data, allowing for parallel processing across
multiple nodes.

2. Reduce Function

The Reduce function takes the intermediate key-value pairs produced by the Map function and
merges all values associated with the same key. The signature of the Reduce function is:

reduce(key', list of values) -> list of (key', value')

This function aggregates the results, producing the final output.

Processing Workflow of MapReduce

1. Input Splitting: The input data is split into fixed-size chunks (typically 64MB or
128MB), which are then processed independently by the mappers. This division ensures
parallel processing.
2. Mapping: Each split is processed by a Map task, which transforms the input data into
intermediate key-value pairs. This phase is highly parallelizable, as each Map task
operates independently.
3. Shuffling and Sorting: The intermediate key-value pairs generated by the Map tasks are
shuffled and sorted by key. This step ensures that all values associated with a particular
key are grouped together and sent to the same Reducer.
4. Reducing: The Reduce tasks receive the grouped intermediate key-value pairs and
aggregate them to produce the final output. This phase combines the results from the
Mappers, ensuring that each key is processed by a single Reducer.
5. Output: The results of the Reduce tasks are written to the distributed file system (HDFS
in the case of Hadoop).

Example: Word Count Using MapReduce

To illustrate how MapReduce works, let's consider a classic example: counting the occurrences
of each word in a large collection of text documents.

1. Input Data

Assume we have a set of text files containing the following lines:

file1.txt: "Hello world"


file2.txt: "Hello Hadoop"
file3.txt: "Hello MapReduce"

2. Map Function

The Map function processes each line of the text files and generates intermediate key-value pairs
where the key is a word and the value is 1.

def map(key, value):


words = value.split()
for word in words:
yield (word, 1)

For the above input data, the intermediate key-value pairs produced by the Map function would
be:

("Hello", 1)
("world", 1)
("Hello", 1)
("Hadoop", 1)
("Hello", 1)
("MapReduce", 1)

3. Shuffle and Sort

The intermediate key-value pairs are shuffled and sorted by key:

("Hadoop", [1])
("Hello", [1, 1, 1])
("MapReduce", [1])
("world", [1])
4. Reduce Function

The Reduce function aggregates the values for each key to produce the final word count:

def reduce(key, values):


total = sum(values)
yield (key, total)

The final output after the Reduce phase would be:

("Hadoop", 1)
("Hello", 3)
("MapReduce", 1)
("world", 1)

Justification and Benefits of MapReduce

1. Scalability

MapReduce is highly scalable, enabling the processing of petabytes of data by distributing the
workload across many nodes in a cluster. Each node processes a subset of the data, making it
possible to handle massive datasets efficiently.

2. Fault Tolerance

Hadoop's implementation of MapReduce includes robust fault tolerance mechanisms. If a node


fails during processing, the system automatically reruns the task on another node, ensuring
reliability and continuous processing without manual intervention.

3. Parallel Processing

By design, MapReduce allows for parallel processing. The Map tasks run concurrently on
different nodes, significantly speeding up the data processing time. Similarly, the Reduce tasks
operate in parallel, further enhancing performance.

4. Simplicity

The MapReduce model abstracts the complexity of distributed computing from the programmer.
Developers only need to focus on writing the Map and Reduce functions, while the underlying
framework handles data distribution, parallelization, and fault tolerance.

5. Flexibility

MapReduce can be used for a wide range of data processing tasks beyond word counting,
including log analysis, data transformations, and more complex computations like graph
processing and machine learning.

You might also like