0% found this document useful (0 votes)
6 views

Map Reduce Design and Execution Framework Part 1

The document discusses MapReduce and Hadoop. It provides an overview of MapReduce including how it works, key concepts, and examples like word count. It also discusses implementations of MapReduce like Hadoop.

Uploaded by

l200908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Map Reduce Design and Execution Framework Part 1

The document discusses MapReduce and Hadoop. It provides an overview of MapReduce including how it works, key concepts, and examples like word count. It also discusses implementations of MapReduce like Hadoop.

Uploaded by

l200908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

HADOOP AND

MAP REDUCE
Map Reduce
■ Idea:
– Bring computation close to the data
– Provide unified programming model to simplify parallelism
– Store data redundantly for reliability

Builds on Distributed File Systems


Distributed File System HDFS

■ Reliable distributed file system


■ Data kept in blocks spread across machines
■ Each block replicated on different machines
– Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2
Machine 1 Machine 2 Machine 3 Machine N

Bring computation directly to the data!

HDFS DETAILS LATER


MapReduce: Overview
■ Sequentially read a lot of data
■ Map:
– Extract something you care about
■ Group by key: Sort and Shuffle
■ Reduce:
– Aggregate, summarize, filter or transform
■ Write the result

Outline stays the same, Map and


Reduce change to fit the problem

4
MAP REDUCE –KEY IDEA

■ Key idea: Programmers specify two functions:


– map (k, v) → <k’, v’>*
– reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the same reducer

The execution framework handles everything else…

(Dean and Ghemawat, OSDI 2004)


MapReduce - Word Count
Warm-up task:
■ We have a huge text document

■ Count the number of times each distinct word appears in the file

■ Sample application:
– Analyze web server logs to find popular URLs
MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

12
Word Count Using MapReduce
from mrjob.job import MRJob
map(key, value):
class WordCount(MRJob):
// key: document name; value: text of
the document
def mapper(self, _, line): for each word w in value:
for word in line.split(): emit(w, 1)
yield(word, 1)
reduce(key, values):
def reducer(self, word, counts): // key: a word; value:an array counts
result = 0
yield(word, sum(counts)) for each count v in values:
result += v
if __name__ == '__main__': emit(key, result)

WordCount.run()
https://mrjob.readthedocs.io/en/latest/
13
MapReduce “word count” example

Map Group by key Reduce


Waterloo is a (waterloo,1) (waterloo, (is, 1)
city in Ontario, (is, 1) [1,1,1]) (smallest, 1)
Canada. It is (a, 1) … (is, [1]) (of, 2) …
the smallest of (smallest, 1) (smallest, [1])
three cities in (of,1) (of, [1,1]) (municipality, 1)
the Regional (three, 1) … (municipality, (county, 1)
Municipality of (municipality,1) [1]) (a, 1) …
(of,1)
Waterloo (and (county, [1])
(waterloo, 1) …
previously in (a,1) (waterloo, 3)
(waterloo, 1)
Waterloo (county, 1) (three, [1]) (three, 1)
County, (ontario, 1) (ontario, [1]) (ontario, 1)
Ontario), and is … …

adjacent to the
Bigof
city document
Kitchener.

Example: Inverted Index
■ This was the original Google's usecase
■ Generate an inverted index of words from a given set of files

• Map:
▫ parses a document and emits
<word, docId> pairs
• Reduce:
▫ takes all pairs for a given word,
sorts the docId values, and emits
a <word,list(docId)> pair
Example: Language modeling

■ Statistical machine translation:


– Need to count number of times every 5-word sequence occurs in
a large corpus of documents

Map •extract (5‐word sequence, count) from


document

Reduce •combine counts


Example: Distributed Grep

■ Find all occurrences of the given pattern in a very large set of files.

Map: •Apply grep on assigned documents


•Emit list of documents that contain term

Reduce: •Merge lists


MapReduce Implementations
■ Google has a proprietary implementation in C++
– Bindings in Java, Python
■ Hadoop is an open-source implementation in Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem
■ Lots of custom research implementations
– For GPUs, cell processors, etc.
Map-Reduce

Input &
Output is often an
output are
input to another
stored on
Map Reduce task
DFS

Scheduler try to schedule map task close to Intermediate results are stored on the
the physical storage location of input data local FS of Map & Reduce tasks to
avoid network traffic

You might also like