100% found this document useful (1 vote)
134 views

14 MapReduce PDF

The document discusses MapReduce and MRJob. It provides an introduction to MapReduce, describing its programming model and how it handles distributed computations. It also introduces MRJob, a Python tool for writing MapReduce jobs in Python. The rest of the document demonstrates MapReduce concepts like the word count example, mappers, reducers, and partitioners using MRJob. It discusses design patterns for MapReduce like combining to reduce network traffic. Finally, it shows how to represent graphs and solve problems like shortest paths using MapReduce.

Uploaded by

Matheus Silva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
134 views

14 MapReduce PDF

The document discusses MapReduce and MRJob. It provides an introduction to MapReduce, describing its programming model and how it handles distributed computations. It also introduces MRJob, a Python tool for writing MapReduce jobs in Python. The rest of the document demonstrates MapReduce concepts like the word count example, mappers, reducers, and partitioners using MRJob. It discusses design patterns for MapReduce like combining to reduce network traffic. Finally, it shows how to represent graphs and solve problems like shortest paths using MapReduce.

Uploaded by

Matheus Silva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Map Reduce and

MRJob

Verena Kaynig-Fittkau
[email protected]
Todays Agenda
Some Python basics
Classes, generators
Introduction to MRJob
MapReduce design patterns

Data-Intensive Text Processing


with MapReduce
Jimmy Lin and Chris Dyer.
Morgan & Claypool Publishers, 2010.
MapReduce
Programming model for distributed computations
Software framework for clusters
Massive data processing
No hassle with low level programming
Partitioning input data
Scheduling execution
Handling failures
Intermachine communication

Open source implementation


MRJob: Python class for Hadoop Streaming
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
The Famous Word Count Example
Python Classes
Python Classes
Derived Class in Python
Derived Class in Python
Overwriting a Function
Overwriting a Function
The Famous Word Count Example
Python generators

http://wiki.python.org/moin/Generators
Python generators

http://wiki.python.org/moin/Generators
Build in Python Generators

http://wiki.python.org/moin/Generators
The Famous Word Count Example
Example Input File


Launching the Job
Output File

50 words in total
Why Counting Words is Fun

http://books.google.com/ngrams
More Culturomics
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
Importance of Local Aggregation
Ideal scaling characteristics:
Twice the data, twice the running time
Twice the resources, half the running time
Why cant we achieve this?
Synchronization requires communication
Communication kills performance
Thus avoid communication!
Reduce intermediate data via local aggregation
Two possibilities:
Combiners
In-mapper combining
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 9 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
Combiner
mini-reducers
Takes mapper output before shuffle and sort
Can significantly reduce network traffic
No access to other mappers
Not guaranteed to get all values for a key
Not guaranteed to run at all!
Key and value output must match mapper
Word Count with Combiner
Combiner Design
Combiners and reducers share same method signature
Sometimes, reducers can serve as combiners
Often, not
Remember: combiner are optional optimizations
Should not affect algorithm correctness
May be run 0, 1, or multiple times
Example: find average of all integers associated with the
same key
Computing the Mean: Version 1

Why cant we use reducer as combiner?


Computing the Mean: Version 2

Why doesnt this work?


Computing the Mean: Version 3

Fixed?
In-Mapper Combining
Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
In-Mapper Combining
Advantages
Speed
Why is this faster than actual combiners?
Disadvantages
Explicit memory management required
Potential for bugs
Word Count with In-Mapper-Comb.
Word of Caution

1!!
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 9 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
Partitioner
Decides which reducer gets which key-value pair
Default depends on hash value of key in raw byte
representation
May result in unequal load balancing
Custom partitioner often wise for complex keys
Python Code for Partitioner
Hadoop has a selection of partitioners
You can specify which partitioner MRJob should choose:

The key specification is of the form


-kpos1[,pos2]
Pos1 is the number of the key field to use
Fields are numbered starting with 1
The n specifies integer keys

http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html
http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg
Whats a graph?
G = (V,E), where
V represents the set of vertices (nodes)
E represents the set of edges (links)
Both vertices and edges may contain additional information
Different types of graphs:
Directed vs. undirected edges
Presence or absence of cycles
Graphs are everywhere:
Hyperlink structure of the Web
Physical structure of computers on the Internet
Interstate highway system
Social networks
Some Graph Problems
Finding shortest paths
Routing Internet traffic and UPS trucks
Finding minimum spanning trees
Telco laying down fiber

Identify special nodes and communities


Breaking up terrorist cells, spread of avian flu
Bipartite matching
Monster.com, Match.com
PageRank
Graphs and MapReduce
Graph algorithms typically involve:
Performing computations at each node: based on node features,
edge features, and local link structure
Propagating computations: traversing the graph
Key questions:
How do you represent graph data in MapReduce?
How do you traverse a graph in MapReduce?
Representing Graphs
G = (V, E)
Two common representations
Adjacency matrix
Adjacency list
Adjacency Matrices
Represent a graph as an n x n square matrix M
n = |V|
Mij = 1 means a link from node i to j

2
1 2 3 4
1 0 1 0 1 1

3
2 1 0 1 1
3 1 0 0 0
4 1 0 1 0 4
Adjacency Matrices: Critique
Advantages:
Amenable to mathematical manipulation
Iteration over rows and columns corresponds to computations on
outlinks and inlinks
Disadvantages:
Lots of zeros for sparse matrices
Lots of wasted space
Adjacency Lists
Take adjacency matrices and throw away all the zeros

1 2 3 4
1 0 1 0 1 1: 2, 4
2 1 0 1 1 2: 1, 3, 4
3 1 0 0 0 3: 1
4: 1, 3
4 1 0 1 0
Adjacency Lists: Critique
Advantages:
Much more compact representation
Easy to compute over outlinks
Disadvantages:
Much more difficult to compute over inlinks
JSON Encoding
JSON: JavaScript Object Notation
lightweight data interchange format

http://docs.python.org/library/json.html
Example Input

1 10

5 1 2

2
3
JSON for MRJob
Single Source Shortest Path
Problem: find shortest path from a source node to one or
more target nodes
Shortest might also mean lowest weight or cost
MapReduce: parallel Breadth-First Search (BFS)
Finding the Shortest Path
Consider simple case of equal edge weights
Solution to the problem can be defined inductively
Heres the intuition:
Define: b is reachable from a if b is on adjacency list of a
DISTANCETO(s) = 0
For all nodes p reachable from s,
DISTANCETO(p) = 1
For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m M)
d1 m1

d2
s n
m2

d3
m3
Visualizing Parallel BFS

n7
n0 n1

n3 n2
n6

n5
n4
n8

n9
From Intuition to Algorithm
Data representation:
Key: node n
Value: d (distance from start), adjacency list (list of nodes
reachable from n)
Initialization: for all nodes except for start node, d =
Mapper:
m adjacency list: emit (m, d + 1)
Sort/Shuffle
Groups distances by reachable nodes
Reducer:
Selects minimum distance path for each reachable node
Additional bookkeeping needed to keep track of actual path
Multiple Iterations Needed
Each MapReduce iteration advances the known frontier
by one hop
Subsequent iterations include more and more reachable nodes as
frontier expands
Multiple iterations are needed to explore entire graph
Preserving graph structure:
Problem: Where did the adjacency list go?
Solution: mapper emits (n, adjacency list) as well
BFS Pseudo-Code
Weighted Edges
Now add positive weights to the edges
Simple change: adjacency list now includes a weight w for
each edge
In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m
Stopping Criterion
No updates in an iteration
When a node is first discovered, weve not found the
shortest path
Additional Complexities

1
search frontier 1
n6 n7 1
n8
r 10
1
n5 n9
n1
s 1 1
q
p 1 n4
n2 1
n3
Graphs and MapReduce
Graph algorithms typically involve:
Performing computations at each node: based on node features,
edge features, and local link structure
Propagating computations: traversing the graph
Generic recipe:
Represent graphs as adjacency lists
Perform local computations in mapper
Pass along partial results via outlinks, keyed by destination node
Perform aggregation in reducer on inlinks to a node
Iterate until convergence: controlled by external driver
Dont forget to pass the graph structure between iterations
Random Walks Over the Web
Random surfer model:
User starts at a random Web page
User randomly clicks on links, surfing from page to page
PageRank
Characterizes the amount of time spent on any given page
Mathematically, a probability distribution over pages
PageRank captures notions of page importance
Correspondence to human intuition?
One of thousands of features used in web search
PageRank: Defined
Given page x with inlinks t1tn, where
C(t) is the out-degree of t
is probability of random jump
N is the total number of nodes in the graph

1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )

t1

t2


tn
Computing PageRank
Properties of PageRank
Can be computed iteratively
Effects at each iteration are local
Sketch of algorithm:
Start with seed PRi values
Each page distributes PRi credit to all pages it links to
Each target page adds up credit from multiple in-bound links to
compute PRi+1
Iterate until values converge
Simplified PageRank
First, tackle the simple case:
No random jump factor
No dangling links
Sample PageRank Iteration (1)

Iteration 1 n2 (0.2) n2 (0.166)

0.1
n1 (0.2) 0.1 0.1 n1 (0.066)

0.1
0.066
0.066 0.066
n5 (0.2) n5 (0.3)
n3 (0.2) n3 (0.166)
0.2 0.2

n4 (0.2) n4 (0.3)

1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )
Sample PageRank Iteration (2)

Iteration 2 n2 (0.166) n2 (0.133)

n1 (0.066) 0.033 0.083


0.083 n1 (0.1)

0.033
0.1
0.1 0.1
n5 (0.3) n5 (0.383)
n3 (0.166) n3 (0.183)
0.3 0.166

n4 (0.3) n4 (0.2)

1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )
PageRank in MapReduce n2 (0.2)

0.1
n1 (0.2) 0.1 0.1

0.1
0.066
0.066 0.066
n5 (0.2)
n3 (0.2)
0.2 0.2

n4 (0.2)

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

Map
n2 n4 n3 n5 n4 n5 n1 n2 n3

n1 n2 n2 n3 n3 n4 n4 n5 n5

Reduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]


PageRank Pseudo-Code
AMAZON ACCOUNT SETUP
Go to aws.amazon.com
Register as a new user
Fill in Account Form
Name
Address

Credit Card Information!


You get a 100$ AWS credit code for your course work. If
you use more, your credit card will be charged!

Complete survey to register for credit code (see HW1)


Wait for email reply with code
Redeem Credit
Account Activity
Create Billing Alarm

Up to 10 alarms per month free


How to Configure MRJob for EC2
Set environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Use configuration file


Set environment variable to configuration file path
MRJOB_CONF
MRJOB CONFIG FILE
runners:
emr:
# be careful when editing this file
# spaces vs tabs are important
aws_access_key_id: MY_KEY_IS_SECURE
# if you want to run in a different region
# set it here
# aws_region: us-west-1
aws_secret_access_key: SO_IS_MY_PASSWORD
# see the following link for different instance types.
# use api names. http://aws.amazon.com/ec2/instance-types/
ec2_instance_type: m1.small
num_ec2_instances: 1
check_emr_status_every: 5
How to Find your Keys
Access Credentials
Running on Amazon
Example Output
Example Output Part 2
Good to Know
Starting a job on Amazon can take a few minutes
This is even the case for very small jobs
Test locally!
But, make sure code runs in cloud!

Amazon takes some time to update your billing


information.

You might also like