14 MapReduce PDF
14 MapReduce PDF
MRJob
Verena Kaynig-Fittkau
[email protected]
Todays Agenda
Some Python basics
Classes, generators
Introduction to MRJob
MapReduce design patterns
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
The Famous Word Count Example
Python Classes
Python Classes
Derived Class in Python
Derived Class in Python
Overwriting a Function
Overwriting a Function
The Famous Word Count Example
Python generators
http://wiki.python.org/moin/Generators
Python generators
http://wiki.python.org/moin/Generators
Build in Python Generators
http://wiki.python.org/moin/Generators
The Famous Word Count Example
Example Input File
Launching the Job
Output File
50 words in total
Why Counting Words is Fun
http://books.google.com/ngrams
More Culturomics
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
Importance of Local Aggregation
Ideal scaling characteristics:
Twice the data, twice the running time
Twice the resources, half the running time
Why cant we achieve this?
Synchronization requires communication
Communication kills performance
Thus avoid communication!
Reduce intermediate data via local aggregation
Two possibilities:
Combiners
In-mapper combining
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
Combiner
mini-reducers
Takes mapper output before shuffle and sort
Can significantly reduce network traffic
No access to other mappers
Not guaranteed to get all values for a key
Not guaranteed to run at all!
Key and value output must match mapper
Word Count with Combiner
Combiner Design
Combiners and reducers share same method signature
Sometimes, reducers can serve as combiners
Often, not
Remember: combiner are optional optimizations
Should not affect algorithm correctness
May be run 0, 1, or multiple times
Example: find average of all integers associated with the
same key
Computing the Mean: Version 1
Fixed?
In-Mapper Combining
Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
In-Mapper Combining
Advantages
Speed
Why is this faster than actual combiners?
Disadvantages
Explicit memory management required
Potential for bugs
Word Count with In-Mapper-Comb.
Word of Caution
1!!
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
Partitioner
Decides which reducer gets which key-value pair
Default depends on hash value of key in raw byte
representation
May result in unequal load balancing
Custom partitioner often wise for complex keys
Python Code for Partitioner
Hadoop has a selection of partitioners
You can specify which partitioner MRJob should choose:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html
http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg
Whats a graph?
G = (V,E), where
V represents the set of vertices (nodes)
E represents the set of edges (links)
Both vertices and edges may contain additional information
Different types of graphs:
Directed vs. undirected edges
Presence or absence of cycles
Graphs are everywhere:
Hyperlink structure of the Web
Physical structure of computers on the Internet
Interstate highway system
Social networks
Some Graph Problems
Finding shortest paths
Routing Internet traffic and UPS trucks
Finding minimum spanning trees
Telco laying down fiber
2
1 2 3 4
1 0 1 0 1 1
3
2 1 0 1 1
3 1 0 0 0
4 1 0 1 0 4
Adjacency Matrices: Critique
Advantages:
Amenable to mathematical manipulation
Iteration over rows and columns corresponds to computations on
outlinks and inlinks
Disadvantages:
Lots of zeros for sparse matrices
Lots of wasted space
Adjacency Lists
Take adjacency matrices and throw away all the zeros
1 2 3 4
1 0 1 0 1 1: 2, 4
2 1 0 1 1 2: 1, 3, 4
3 1 0 0 0 3: 1
4: 1, 3
4 1 0 1 0
Adjacency Lists: Critique
Advantages:
Much more compact representation
Easy to compute over outlinks
Disadvantages:
Much more difficult to compute over inlinks
JSON Encoding
JSON: JavaScript Object Notation
lightweight data interchange format
http://docs.python.org/library/json.html
Example Input
1 10
5 1 2
2
3
JSON for MRJob
Single Source Shortest Path
Problem: find shortest path from a source node to one or
more target nodes
Shortest might also mean lowest weight or cost
MapReduce: parallel Breadth-First Search (BFS)
Finding the Shortest Path
Consider simple case of equal edge weights
Solution to the problem can be defined inductively
Heres the intuition:
Define: b is reachable from a if b is on adjacency list of a
DISTANCETO(s) = 0
For all nodes p reachable from s,
DISTANCETO(p) = 1
For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m M)
d1 m1
d2
s n
m2
d3
m3
Visualizing Parallel BFS
n7
n0 n1
n3 n2
n6
n5
n4
n8
n9
From Intuition to Algorithm
Data representation:
Key: node n
Value: d (distance from start), adjacency list (list of nodes
reachable from n)
Initialization: for all nodes except for start node, d =
Mapper:
m adjacency list: emit (m, d + 1)
Sort/Shuffle
Groups distances by reachable nodes
Reducer:
Selects minimum distance path for each reachable node
Additional bookkeeping needed to keep track of actual path
Multiple Iterations Needed
Each MapReduce iteration advances the known frontier
by one hop
Subsequent iterations include more and more reachable nodes as
frontier expands
Multiple iterations are needed to explore entire graph
Preserving graph structure:
Problem: Where did the adjacency list go?
Solution: mapper emits (n, adjacency list) as well
BFS Pseudo-Code
Weighted Edges
Now add positive weights to the edges
Simple change: adjacency list now includes a weight w for
each edge
In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m
Stopping Criterion
No updates in an iteration
When a node is first discovered, weve not found the
shortest path
Additional Complexities
1
search frontier 1
n6 n7 1
n8
r 10
1
n5 n9
n1
s 1 1
q
p 1 n4
n2 1
n3
Graphs and MapReduce
Graph algorithms typically involve:
Performing computations at each node: based on node features,
edge features, and local link structure
Propagating computations: traversing the graph
Generic recipe:
Represent graphs as adjacency lists
Perform local computations in mapper
Pass along partial results via outlinks, keyed by destination node
Perform aggregation in reducer on inlinks to a node
Iterate until convergence: controlled by external driver
Dont forget to pass the graph structure between iterations
Random Walks Over the Web
Random surfer model:
User starts at a random Web page
User randomly clicks on links, surfing from page to page
PageRank
Characterizes the amount of time spent on any given page
Mathematically, a probability distribution over pages
PageRank captures notions of page importance
Correspondence to human intuition?
One of thousands of features used in web search
PageRank: Defined
Given page x with inlinks t1tn, where
C(t) is the out-degree of t
is probability of random jump
N is the total number of nodes in the graph
1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )
t1
t2
tn
Computing PageRank
Properties of PageRank
Can be computed iteratively
Effects at each iteration are local
Sketch of algorithm:
Start with seed PRi values
Each page distributes PRi credit to all pages it links to
Each target page adds up credit from multiple in-bound links to
compute PRi+1
Iterate until values converge
Simplified PageRank
First, tackle the simple case:
No random jump factor
No dangling links
Sample PageRank Iteration (1)
0.1
n1 (0.2) 0.1 0.1 n1 (0.066)
0.1
0.066
0.066 0.066
n5 (0.2) n5 (0.3)
n3 (0.2) n3 (0.166)
0.2 0.2
n4 (0.2) n4 (0.3)
1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )
Sample PageRank Iteration (2)
0.033
0.1
0.1 0.1
n5 (0.3) n5 (0.383)
n3 (0.166) n3 (0.183)
0.3 0.166
n4 (0.3) n4 (0.2)
1 n
PR(ti )
PR( x) (1 )
N i 1 C (ti )
PageRank in MapReduce n2 (0.2)
0.1
n1 (0.2) 0.1 0.1
0.1
0.066
0.066 0.066
n5 (0.2)
n3 (0.2)
0.2 0.2
n4 (0.2)
Map
n2 n4 n3 n5 n4 n5 n1 n2 n3
n1 n2 n2 n3 n3 n4 n4 n5 n5
Reduce