0% found this document useful (0 votes)
52 views

CS345 Data Mining: Link Analysis Algorithms Page Rank

The document discusses algorithms for link analysis and ranking web pages, specifically PageRank. It explains that PageRank works by modeling the probability of a random web surfer clicking on links as they browse pages. This can be represented as a stochastic matrix where columns sum to 1. The principal eigenvector of this matrix, which represents the stationary probabilities of the random walk, provides the PageRank values. Issues like dead ends and spider traps are addressed through random teleportation. Large-scale computation is enabled by formulating the problem to leverage the sparse structure of the underlying link graph.

Uploaded by

davidmurrieta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

CS345 Data Mining: Link Analysis Algorithms Page Rank

The document discusses algorithms for link analysis and ranking web pages, specifically PageRank. It explains that PageRank works by modeling the probability of a random web surfer clicking on links as they browse pages. This can be represented as a stochastic matrix where columns sum to 1. The principal eigenvector of this matrix, which represents the stationary probabilities of the random walk, provides the PageRank values. Issues like dead ends and spider traps are addressed through random teleportation. Large-scale computation is enabled by formulating the problem to leverage the sparse structure of the underlying link graph.

Uploaded by

davidmurrieta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

CS345

Data Mining

Link Analysis Algorithms


Page Rank

Anand Rajaraman, Jeffrey D. Ullman


Link Analysis Algorithms
 Page Rank
 Hubs and Authorities
 Topic-Specific Page Rank
 Spam Detection Algorithms
 Other interesting topics we won’t cover
 Detecting duplicates and mirrors
 Mining for communities
 Classification
 Spectral clustering
Ranking web pages
 Web pages are not equally “important”
 www.joe-schmoe.com v www.stanford.edu
 Inlinks as votes
 www.stanford.edu has 23,400 inlinks
 www.joe-schmoe.com has 1 inlink
 Are all inlinks equal?
 Recursive question!
Simple recursive formulation
 Each link’s vote is proportional to the
importance of its source page
 If page P with importance x has n
outlinks, each link gets x/n votes
 Page P’s own importance is the sum of
the votes on its inlinks
Simple “flow” model
The web in 1839
y = y /2 + a /2
y/2 a = y /2 + m
Yahoo
y m = a /2
a/2 y/2

m
Amazon M’soft
a/2 m
a
Solving the flow equations
 3 equations, 3 unknowns, no constants
 No unique solution
 All solutions equivalent modulo scale factor
 Additional constraint forces uniqueness
 y+a+m = 1
 y = 2/5, a = 2/5, m = 1/5
 Gaussian elimination method works for
small examples, but we need a better
method for large graphs
Matrix formulation
 Matrix M has one row and one column for each
web page
 Suppose page j has n outlinks
 If j ! i, then Mij=1/n
 Else Mij=0
 M is a column stochastic matrix
 Columns sum to 1
 Suppose r is a vector with one entry per web
page
 ri is the importance score of page i
 Call it the rank vector
 |r| = 1
Example
Suppose page j links to 3 pages, including i
j

i i
=
1/3

M r r
Eigenvector formulation
 The flow equations can be written
r = Mr
 So the rank vector is an eigenvector of
the stochastic web matrix
 In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
Example
y a m
Yahoo y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0

r = Mr
Amazon M’soft
y 1/2 1/2 0 y
y = y /2 + a /2 a = 1/2 0 1 a
a = y /2 + m m 0 1/2 0 m
m = a /2
Power Iteration method
 Simple iterative scheme (aka relaxation)
 Suppose there are N web pages
 Initialize: r0 = [1/N,….,1/N]T
 Iterate: rk+1 = Mrk
 Stop when |rk+1 - rk|1 < 
 |x|1 = 1≤i≤N|xi| is the L1 norm
 Can use any other vector norm e.g.,
Euclidean
Power Iteration Example

Yahoo y a m
y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0

Amazon M’soft

y 1/3 1/3 5/12 3/8 2/5


a = 1/3 1/2 1/3 11/24 . . . 2/5
m 1/3 1/6 1/4 1/6 1/5
Random Walk Interpretation
 Imagine a random web surfer
 At any time t, surfer is on some page P
 At time t+1, the surfer follows an outlink
from P uniformly at random
 Ends up on some page Q linked from P
 Process repeats indefinitely
 Let p(t) be a vector whose ith component
is the probability that the surfer is at
page i at time t
 p(t) is a probability distribution on pages
The stationary distribution
 Where is the surfer at time t+1?
 Follows a link uniformly at random
 p(t+1) = Mp(t)
 Suppose the random walk reaches a
state such that p(t+1) = Mp(t) = p(t)
 Then p(t) is called a stationary distribution
for the random walk
 Our rank vector r satisfies r = Mr
 So it is a stationary distribution for the
random surfer
Existence and Uniqueness
A central result from the theory of random
walks (aka Markov processes):

For graphs that satisfy certain


conditions, the stationary distribution is
unique and eventually will be reached no
matter what the initial probability
distribution at time t = 0.
Spider traps
 A group of pages is a spider trap if there
are no links from within the group to
outside the group
 Random surfer gets trapped
 Spider traps violate the conditions
needed for the random walk theorem
Microsoft becomes a spider trap

Yahoo y a m
y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1

Amazon M’soft

y 1 1 3/4 5/8 0
a = 1 1/2 1/2 3/8 ... 0
m 1 3/2 7/4 2 3
Random teleports
 The Google solution for spider traps
 At each time step, the random surfer
has two options:
 With probability , follow a link at random
 With probability 1-, jump to some page
uniformly at random
 Common values for  are in the range 0.8 to
0.9
 Surfer will teleport out of spider trap
within a few time steps
Random teleports ()
0.2*1/3 y y y
1/2
Yahoo 0.8*1/2 y 1/2 1/2 1/3
a 1/2 0.8* 1/2 + 0.2* 1/3
1/2 m 0 0 1/3
0.8*1/2 0.2*1/3
0.2*1/3
1/2 1/2 0 1/3 1/3 1/3
Amazon M’soft 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15


a 7/15 1/15 1/15
m 1/15 7/15 13/15
Random teleports ()
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15


a 7/15 1/15 1/15
m 1/15 7/15 13/15
Amazon M’soft

y 1 1.00 0.84 0.776 7/11


a = 1 0.60 0.60 0.536 . . . 5/11
m 1 1.40 1.56 1.688 21/11
Matrix formulation
 Suppose there are N pages
 Consider a page j, with set of outlinks O(j)
 We have Mij = 1/|O(j)| when j!i and Mij = 0
otherwise
 The random teleport is equivalent to
 adding a teleport link from j to every other

page with probability (1-)/N


 reducing the probability of following each

outlink from 1/|O(j)| to /|O(j)|


 Equivalent: tax each page a fraction (1-)

of its score and redistribute evenly


Page Rank
 Construct the N£N matrix A as follows
 Aij = Mij + (1-)/N
 Verify that A is a stochastic matrix
 The page rank vector r is the principal
eigenvector of this matrix
 satisfying r = Ar
 Equivalently, r is the stationary
distribution of the random walk with
teleports
Dead ends
 Pages with no outlinks are “dead ends”
for the random surfer
 Nowhere to go on next step
Microsoft becomes a dead end
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 0 1/3 1/3 1/3

y 7/15 7/15 1/15


a 7/15 1/15 1/15
m 1/15 7/15 1/15
Amazon M’soft

y Non-
1 1 0.787 0.648 0
a = stochastic!
1 0.6 0.547 0.430 . . . 0
m 1 0.6 0.387 0.333 0
Dealing with dead-ends
 Teleport
 Follow random teleport links with probability
1.0 from dead-ends
 Adjust matrix accordingly
 Prune and propagate
 Preprocess the graph to eliminate dead-ends
 Might require multiple passes
 Compute page rank on reduced graph
 Approximate values for deadends by
propagating values from reduced graph
Computing page rank
 Key step is matrix-vector multiplication
 rnew = Arold
 Easy if we have enough main memory to
hold A, rold, rnew
 Say N = 1 billion pages
 We need 4 bytes for each entry (say)
 2 billion entries for vectors, approx 8GB
 Matrix A has N2 entries
 1018 is a large number!
Rearranging the equation
r = Ar, where
Aij = Mij + (1-)/N
ri =1≤j≤N Aij rj
ri =1≤j≤N [Mij + (1-)/N] rj
= 1≤j≤N Mij rj + (1-)/N 1≤j≤N rj
= 1≤j≤N Mij rj + (1-)/N, since |r| = 1
r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
Sparse matrix formulation
 We can rearrange the page rank equation:
 r = Mr + [(1-)/N]N
 [(1-)/N]N is an N-vector with all entries (1-)/N
 M is a sparse matrix!
 10 links per node, approx 10N entries
 So in each iteration, we need to:
 Compute rnew = Mrold
 Add a constant value (1-)/N to each entry in rnew
Sparse matrix encoding
 Encode sparse matrix using only
nonzero entries
 Space proportional roughly to number of
links
 say 10N, or 4*10*1 billion = 40GB
 still won’t fit in memory, but will fit on disk
source
degree destination nodes
node
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
Basic Algorithm
 Assume we have enough RAM to fit rnew, plus
some working memory
 Store rold and matrix M on disk

Basic Algorithm:
 Initialize: rold = [1/N]N
 Iterate:
 Update: Perform a sequential scan of M and rold to
update rnew
 Write out rnew to disk as rold for next iteration
 Every few iterations, compute |rnew-rold| and stop if it
is below threshold
 Need to read in both vectors into memory
Update step

Initialize all entries of rnew to (1-)/N


For each page p (out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)
for j = 1..n:
rnew(destj) += *rold(p)/n

rnew src degree destination rold


0 0 3 1, 5, 6 0
1 1
1 4 17, 64, 113, 117 2
2
3 2 2 13, 23 3
4 4
5 5
6 6
Analysis
 In each iteration, we have to:
 Read rold and M
 Write rnew back to disk
 IO Cost = 2|r| + |M|
 What if we had enough memory to fit
both rnew and rold?
 What if we could not even fit rnew in
memory?
 10 billion pages
Block-based update algorithm

rnew src degree destination rold


0 0 4 0, 1, 3, 5 0
1 1
1 2 0, 5 2
2 2 2 3, 4 3
4
3
5
4
5
Analysis of Block Update
 Similar to nested-loop join in databases
 Break rnew into k blocks that fit in memory
 Scan M and rold once for each block
 k scans of M and rold
 k(|M| + |r|) + |r| = k|M| + (k+1)|r|
 Can we do better?
 Hint: M is much bigger than r (approx
10-20x), so we must avoid reading it k
times per iteration
Block-Stripe Update algorithm
src degree destination
r new
0 4 0, 1
0
1 1 3 0 rold
2 2 1 0
1
2
2 0 4 3 3
4
3 2 2 3
5

0 4 5
4 1 3 5
5
2 2 4
Block-Stripe Analysis
 Break M into stripes
 Each stripe contains only destination nodes
in the corresponding block of rnew
 Some additional overhead per stripe
 But usually worth it
 Cost per iteration
 |M|(1+) + (k+1)|r|
Next
 Topic-Specific Page Rank
 Hubs and Authorities
 Spam Detection

You might also like