Module 4 MapReduce and Link Analysis
Module 4 MapReduce and Link Analysis
analysis
ANISHA JOSE PH
Web as a Graph
PageRank
High score not only because of high in links but also high worth links.
Eigen vector Formulation
•Here λ is 1
•A square matrix A is stochastic if all of its entries are nonnegative,
and the entries of each column sum to 1.
n is large, but vector v can fit in main memory and thus be available to every
Map task.
The matrix M and the vector v each will be stored in a file of the DFS.
We assume that the row-column coordinates of each matrix element will be
discoverable, either from its position in the file, or because it is stored with
explicit coordinates, as a triple (i, j,mij).
Matrix-Vector
Multiplication by MapReduce
The Map Function:
Each Map task will operate on a chunk of the matrix M. From each
matrix element mij it produces the key-value pair (i,mijvj). Thus, all
terms of the sum that make up the component xi of the matrix-
vector product will get the same key, i.
The Reduce Function: The Reduce function simply sums all the
values associated with a given key i. The result will be a pair (i, xi).
If the Vector v Cannot Fit in Main
Memory
If the Vector v Cannot Fit in
Main Memory
The ith stripe of the matrix multiplies only components from the ith
stripe of the vector. Thus, we can divide the matrix into one file for
each stripe, and do the same for the vector.
Each Map task is assigned a chunk from one of the stripes of the
matrix and gets the entire corresponding stripe of the vector.
The Map and Reduce tasks can then act exactly as was described
above for the case where Map tasks get the entire vector
Structure
of the Web
Structure of the Web
•The in-component, consisting of pages that could reach the SCC by
following links, but were not reachable from the SCC.
•The out-component, consisting of pages reachable from the SCC but
unable to reach the SCC.
•Tendrils, which are of two types. Some tendrils consist of pages
reachable from the in-component but not able to reach the in-
component. The other tendrils can reach the out-component, but
are not reachable from the out-component.
Structure of the Web
Tubes, which are pages reachable from the in-component and able
to reach the out-component, but unable to reach the SCC or be
reached from the SCC.
Isolated components that are unreachable from the large
components (the SCC, in- and out-components) and unable to reach
those components.
PageRank Problems
Dead ends
If we allow dead ends, the transition matrix of the Web is no longer
stochastic, since some of the columns will sum to 0 rather than 1.
A matrix whose column sums are at most 1 is called substochastic.
Dead ends
There are two approaches to dealing with dead ends.
1. We can drop the dead ends from the graph, and also drop their
incoming arcs. Doing so may create more dead ends, which also have
to be dropped, recursively. However, eventually we wind up with a
strongly-connected component, none of whose nodes are dead
ends.
2.We can modify the process by which random surfers are assumed
to move about the Web. This method, which we refer to as
“taxation,” also solves the problem of spider traps.
Example
Spider Traps and Taxation
•We shall use β = 0.8 in this example. This method, which we refer to as
“taxation,”
•Notice that we have incorporated the factor β into M by multiplying each
of its elements by 4/5. The components of the vector (1 − β)e/n are each
1/20, since 1 − β = 1/5 and n = 4. Here are the first few iterations.
Spider Traps and Taxation