Map Reduce Design and Execution Framework Part 1
Map Reduce Design and Execution Framework Part 1
MAP REDUCE
Map Reduce
■ Idea:
– Bring computation close to the data
– Provide unified programming model to simplify parallelism
– Store data redundantly for reliability
C5 C2 C5 C3 D0 D1 … D0 C2
Machine 1 Machine 2 Machine 3 Machine N
4
MAP REDUCE –KEY IDEA
■ Count the number of times each distinct word appears in the file
■ Sample application:
– Analyze web server logs to find popular URLs
MapReduce Example - WordCount
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
12
Word Count Using MapReduce
from mrjob.job import MRJob
map(key, value):
class WordCount(MRJob):
// key: document name; value: text of
the document
def mapper(self, _, line): for each word w in value:
for word in line.split(): emit(w, 1)
yield(word, 1)
reduce(key, values):
def reducer(self, word, counts): // key: a word; value:an array counts
result = 0
yield(word, sum(counts)) for each count v in values:
result += v
if __name__ == '__main__': emit(key, result)
WordCount.run()
https://mrjob.readthedocs.io/en/latest/
13
MapReduce “word count” example
• Map:
▫ parses a document and emits
<word, docId> pairs
• Reduce:
▫ takes all pairs for a given word,
sorts the docId values, and emits
a <word,list(docId)> pair
Example: Language modeling
■ Find all occurrences of the given pattern in a very large set of files.
Input &
Output is often an
output are
input to another
stored on
Map Reduce task
DFS
Scheduler try to schedule map task close to Intermediate results are stored on the
the physical storage location of input data local FS of Map & Reduce tasks to
avoid network traffic