0% found this document useful (0 votes)
22 views

Module 4 MapReduce and Link Analysis

Uploaded by

Aleesha K B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Module 4 MapReduce and Link Analysis

Uploaded by

Aleesha K B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

MapReduce and link

analysis
ANISHA JOSE PH
Web as a Graph
PageRank
High score not only because of high in links but also high worth links.
Eigen vector Formulation
•Here λ is 1
•A square matrix A is stochastic if all of its entries are nonnegative,
and the entries of each column sum to 1.

•Efficient method to find eigen vector for M.


Definition of PageRank- Example
The transition matrix for the Web
Definition of PageRank
This sort of behavior is an example of the ancient theory of Markov
processes. It is known that the distribution of the surfer approaches
a limiting distribution r that satisfies r = Mr, provided two conditions
are met:
1. The graph is strongly connected; that is, it is possible to get from
any node to any other node.
2. There are no dead ends: nodes that have no arcs out.
Note: A directed graph is strongly connected if there is a path
between any two pair of vertices.
PageRank Iteration Using MapReduce
It is a Matrix Vector Multiplication. So we can use map reduce
functions discussed before.
Matrix-Vector Multiplication by
MapReduce
we have an n×n matrix M, whose element in row i and column j will
be denoted mij . Suppose we also have a vector v of length n, whose
jth element is vj .

If n = 100, we do not want to use a DFS or MapReduce for this


calculation.
Matrix-Vector Multiplication
by MapReduce

n is large, but vector v can fit in main memory and thus be available to every
Map task.
The matrix M and the vector v each will be stored in a file of the DFS.
We assume that the row-column coordinates of each matrix element will be
discoverable, either from its position in the file, or because it is stored with
explicit coordinates, as a triple (i, j,mij).
Matrix-Vector
Multiplication by MapReduce
The Map Function:
Each Map task will operate on a chunk of the matrix M. From each
matrix element mij it produces the key-value pair (i,mijvj). Thus, all
terms of the sum that make up the component xi of the matrix-
vector product will get the same key, i.

The Reduce Function: The Reduce function simply sums all the
values associated with a given key i. The result will be a pair (i, xi).
If the Vector v Cannot Fit in Main
Memory
If the Vector v Cannot Fit in
Main Memory
The ith stripe of the matrix multiplies only components from the ith
stripe of the vector. Thus, we can divide the matrix into one file for
each stripe, and do the same for the vector.
Each Map task is assigned a chunk from one of the stripes of the
matrix and gets the entire corresponding stripe of the vector.
The Map and Reduce tasks can then act exactly as was described
above for the case where Map tasks get the entire vector
Structure
of the Web
Structure of the Web
•The in-component, consisting of pages that could reach the SCC by
following links, but were not reachable from the SCC.
•The out-component, consisting of pages reachable from the SCC but
unable to reach the SCC.
•Tendrils, which are of two types. Some tendrils consist of pages
reachable from the in-component but not able to reach the in-
component. The other tendrils can reach the out-component, but
are not reachable from the out-component.
Structure of the Web
Tubes, which are pages reachable from the in-component and able
to reach the out-component, but unable to reach the SCC or be
reached from the SCC.
Isolated components that are unreachable from the large
components (the SCC, in- and out-components) and unable to reach
those components.
PageRank Problems
Dead ends
If we allow dead ends, the transition matrix of the Web is no longer
stochastic, since some of the columns will sum to 0 rather than 1.
A matrix whose column sums are at most 1 is called substochastic.
Dead ends
There are two approaches to dealing with dead ends.
1. We can drop the dead ends from the graph, and also drop their
incoming arcs. Doing so may create more dead ends, which also have
to be dropped, recursively. However, eventually we wind up with a
strongly-connected component, none of whose nodes are dead
ends.
2.We can modify the process by which random surfers are assumed
to move about the Web. This method, which we refer to as
“taxation,” also solves the problem of spider traps.
Example
Spider Traps and Taxation

•We shall use β = 0.8 in this example. This method, which we refer to as
“taxation,”
•Notice that we have incorporated the factor β into M by multiplying each
of its elements by 4/5. The components of the vector (1 − β)e/n are each
1/20, since 1 − β = 1/5 and n = 4. Here are the first few iterations.
Spider Traps and Taxation

Here are the first few iterations:


Topic-Sensitive PageRank
•The mathematical formulation for the iteration that yields topic-sensitive
PageRank is similar to the equation we used for general PageRank.
•The only difference is how we add the new surfers.
•Suppose S is a set of integers consisting of the row/column numbers for the
pages we have identified as belonging to a certain topic (called the teleport
set).
•Let eS be a vector that has 1 in the components in S and 0 in other
components. Then the topic-sensitive Page- Rank for S is the limit of the
iteration.
Example
Example – without teleporting
Example –with teleporting – standard
page rank
Example – page specific rank
Example – page specific rank
Example – page specific rank
Topic specific page rank (part 1)
topic specific page rank (part 2)
Topic-specific page rank (part 3)
Example
Example
Example
•Suppose that our topic is represented by the teleport set S = {B,D}.
Then the vector (1 − β)eS/|S| has 1/10 for its second and fourth
components and 0 for the other two components.
•The reason is that 1 − β = 1/5, the size of S is 2, and eS has 1 in the
components for B and D and 0 in the components for A and C.
Example
•Thus, the equation that must be iterated is
Example
•Here are the first few iterations of this equation.
•We have also started with the surfers only at the pages in the
teleport set. Although the initial distribution has no effect on the
limit, it may help the computation to converge faster.
On-Line Algorithms
•Before addressing the question of matching advertisements to search
queries, we shall digress slightly by examining the general class to which
such algorithms belong.
•This class is referred to as “on-line,” and they generally involve an
approach called “greedy.”
•All the data needed by the algorithm is presented initially. The algorithm
can access the data in any order. At the end, the algorithm produces its
answer. Such an algorithm is called off-line.
On-Line Algorithms
•There is an extreme form of stream processing, where we must
respond with an output after each stream element arrives.
•We thus must decide about each stream element knowing
nothing at all of the future. Algorithms of this class are called on-line
algorithms.
The Matching Problem
•The problem of matching ads to search queries.
•This problem, called “maximal matching,” is an abstract problem
involving bipartite graphs (graphs with two sets of nodes – left and
right – with all edges connecting a node in the left set to a node in
the right set.
a can be matched with 1 and 4, but we choose (1,a)
b can be matched with 2 and 3, but we choose (2,b)
c can be matched with 1, but we cannot choose (1,c)
• d can be matched with 3, but we choose (3,d)
•It is not a perfect matching or best matching. So this Matching problem
is a heuristic problem.
Greedy Algorithm for Maximal Matching
•In particular, the greedy algorithm for maximal matching works as
follows.
•We consider the edges in whatever order they are given. When we
consider (x, y), add this edge to the matching if neither x nor y
are ends of any edge selected for the matching so far. Otherwise,
skip (x, y).
It is defined as the worst-case ratio between the cost of the solution
produced by the online algorithm and the cost of an optimal solution,
over all possible inputs.
• Here, Cardinality of greedy is 3 and cardinality of optimal is 4
•So, 4<=3+1
• Here, Cardinality of greedy is 3 and cardinality of optimal is 4
•So, 3>=2
• Here, Cardinality of greedy is 3 and cardinality of optimal is 4
•So, 1<=2
• I.e , 1<=2<=3
• I.e competitive ratio of greedy algorithm is at-least ½.
•M Opt - (1,c), (2,d), (3,b) and (4,a)
•Mgreedy = (1,a), (2,b)
•A has high Bid, So Search Engine may use A.
•Sorted order is A, B, C
•Google observed that taking expected revenue gives more revenue.
•Sorted order is B, C, A
Among set of ads for a search query, New Ad may have less CTR.
Competitive ratio is ¾ , which is better that ½.

You might also like