0% found this document useful (0 votes)

52 views

CS345 Data Mining: Link Analysis Algorithms Page Rank

The document discusses algorithms for link analysis and ranking web pages, specifically PageRank. It explains that PageRank works by modeling the probability of a random web surfer clicking on links as they browse pages. This can be represented as a stochastic matrix where columns sum to 1. The principal eigenvector of this matrix, which represents the stationary probabilities of the random walk, provides the PageRank values. Issues like dead ends and spider traps are addressed through random teleportation. Large-scale computation is enabled by formulating the problem to leverage the sparse structure of the underlying link graph.

Uploaded by

davidmurrieta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

CS345 Data Mining: Link Analysis Algorithms Page Rank

Uploaded by

davidmurrieta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

CS345

Data Mining

Link Analysis Algorithms

Page Rank

Anand Rajaraman, Jeffrey D. Ullman

Link Analysis Algorithms
 Page Rank
 Hubs and Authorities
 Topic-Specific Page Rank
 Spam Detection Algorithms
 Other interesting topics we won’t cover
 Detecting duplicates and mirrors
 Mining for communities
 Classification
 Spectral clustering
Ranking web pages
 Web pages are not equally “important”
 www.joe-schmoe.com v www.stanford.edu
 Inlinks as votes
 www.stanford.edu has 23,400 inlinks
 www.joe-schmoe.com has 1 inlink
 Are all inlinks equal?
 Recursive question!
Simple recursive formulation
 Each link’s vote is proportional to the
importance of its source page
 If page P with importance x has n
outlinks, each link gets x/n votes
 Page P’s own importance is the sum of
the votes on its inlinks
Simple “flow” model
The web in 1839
y = y /2 + a /2
y/2 a = y /2 + m
Yahoo
y m = a /2
a/2 y/2

m
Amazon M’soft
a/2 m
a
Solving the flow equations
 3 equations, 3 unknowns, no constants
 No unique solution
 All solutions equivalent modulo scale factor
 Additional constraint forces uniqueness
 y+a+m = 1
 y = 2/5, a = 2/5, m = 1/5
 Gaussian elimination method works for
small examples, but we need a better
method for large graphs
Matrix formulation
 Matrix M has one row and one column for each
web page
 Suppose page j has n outlinks
 If j ! i, then Mij=1/n
 Else Mij=0
 M is a column stochastic matrix
 Columns sum to 1
 Suppose r is a vector with one entry per web
page
 ri is the importance score of page i
 Call it the rank vector
 |r| = 1
Example
Suppose page j links to 3 pages, including i
j

i i
=
1/3

M r r
Eigenvector formulation
 The flow equations can be written
r = Mr
 So the rank vector is an eigenvector of
the stochastic web matrix
 In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
Example
y a m
Yahoo y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0

r = Mr
Amazon M’soft
y 1/2 1/2 0 y
y = y /2 + a /2 a = 1/2 0 1 a
a = y /2 + m m 0 1/2 0 m
m = a /2
Power Iteration method
 Simple iterative scheme (aka relaxation)
 Suppose there are N web pages
 Initialize: r0 = [1/N,….,1/N]T
 Iterate: rk+1 = Mrk
 Stop when |rk+1 - rk|1 < 
 |x|1 = 1≤i≤N|xi| is the L1 norm
 Can use any other vector norm e.g.,
Euclidean
Power Iteration Example

Yahoo y a m
y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0

Amazon M’soft

y 1/3 1/3 5/12 3/8 2/5

a = 1/3 1/2 1/3 11/24 . . . 2/5
m 1/3 1/6 1/4 1/6 1/5
Random Walk Interpretation
 Imagine a random web surfer
 At any time t, surfer is on some page P
 At time t+1, the surfer follows an outlink
from P uniformly at random
 Ends up on some page Q linked from P
 Process repeats indefinitely
 Let p(t) be a vector whose ith component
is the probability that the surfer is at
page i at time t
 p(t) is a probability distribution on pages
The stationary distribution
 Where is the surfer at time t+1?
 Follows a link uniformly at random
 p(t+1) = Mp(t)
 Suppose the random walk reaches a
state such that p(t+1) = Mp(t) = p(t)
 Then p(t) is called a stationary distribution
for the random walk
 Our rank vector r satisfies r = Mr
 So it is a stationary distribution for the
random surfer
Existence and Uniqueness
A central result from the theory of random
walks (aka Markov processes):

For graphs that satisfy certain

conditions, the stationary distribution is
unique and eventually will be reached no
matter what the initial probability
distribution at time t = 0.
Spider traps
 A group of pages is a spider trap if there
are no links from within the group to
outside the group
 Random surfer gets trapped
 Spider traps violate the conditions
needed for the random walk theorem
Microsoft becomes a spider trap

Yahoo y a m
y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1

Amazon M’soft

y 1 1 3/4 5/8 0
a = 1 1/2 1/2 3/8 ... 0
m 1 3/2 7/4 2 3
Random teleports
 The Google solution for spider traps
 At each time step, the random surfer
has two options:
 With probability , follow a link at random
 With probability 1-, jump to some page
uniformly at random
 Common values for  are in the range 0.8 to
0.9
 Surfer will teleport out of spider trap
within a few time steps
Random teleports ()
0.2*1/3 y y y
1/2
Yahoo 0.8*1/2 y 1/2 1/2 1/3
a 1/2 0.8* 1/2 + 0.2* 1/3
1/2 m 0 0 1/3
0.8*1/2 0.2*1/3
0.2*1/3
1/2 1/2 0 1/3 1/3 1/3
Amazon M’soft 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15

a 7/15 1/15 1/15
m 1/15 7/15 13/15
Random teleports ()
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15

a 7/15 1/15 1/15
m 1/15 7/15 13/15
Amazon M’soft

y 1 1.00 0.84 0.776 7/11

a = 1 0.60 0.60 0.536 . . . 5/11
m 1 1.40 1.56 1.688 21/11
Matrix formulation
 Suppose there are N pages
 Consider a page j, with set of outlinks O(j)
 We have Mij = 1/|O(j)| when j!i and Mij = 0
otherwise
 The random teleport is equivalent to
 adding a teleport link from j to every other

page with probability (1-)/N

 reducing the probability of following each

outlink from 1/|O(j)| to /|O(j)|

 Equivalent: tax each page a fraction (1-)

of its score and redistribute evenly

Page Rank
 Construct the N£N matrix A as follows
 Aij = Mij + (1-)/N
 Verify that A is a stochastic matrix
 The page rank vector r is the principal
eigenvector of this matrix
 satisfying r = Ar
 Equivalently, r is the stationary
distribution of the random walk with
teleports
Dead ends
 Pages with no outlinks are “dead ends”
for the random surfer
 Nowhere to go on next step
Microsoft becomes a dead end
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 0 1/3 1/3 1/3

y 7/15 7/15 1/15

a 7/15 1/15 1/15
m 1/15 7/15 1/15
Amazon M’soft

y Non-
1 1 0.787 0.648 0
a = stochastic!
1 0.6 0.547 0.430 . . . 0
m 1 0.6 0.387 0.333 0
Dealing with dead-ends
 Teleport
 Follow random teleport links with probability
1.0 from dead-ends
 Adjust matrix accordingly
 Prune and propagate
 Preprocess the graph to eliminate dead-ends
 Might require multiple passes
 Compute page rank on reduced graph
 Approximate values for deadends by
propagating values from reduced graph
Computing page rank
 Key step is matrix-vector multiplication
 rnew = Arold
 Easy if we have enough main memory to
hold A, rold, rnew
 Say N = 1 billion pages
 We need 4 bytes for each entry (say)
 2 billion entries for vectors, approx 8GB
 Matrix A has N2 entries
 1018 is a large number!
Rearranging the equation
r = Ar, where
Aij = Mij + (1-)/N
ri =1≤j≤N Aij rj
ri =1≤j≤N [Mij + (1-)/N] rj
= 1≤j≤N Mij rj + (1-)/N 1≤j≤N rj
= 1≤j≤N Mij rj + (1-)/N, since |r| = 1
r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
Sparse matrix formulation
 We can rearrange the page rank equation:
 r = Mr + [(1-)/N]N
 [(1-)/N]N is an N-vector with all entries (1-)/N
 M is a sparse matrix!
 10 links per node, approx 10N entries
 So in each iteration, we need to:
 Compute rnew = Mrold
 Add a constant value (1-)/N to each entry in rnew
Sparse matrix encoding
 Encode sparse matrix using only
nonzero entries
 Space proportional roughly to number of
links
 say 10N, or 4*10*1 billion = 40GB
 still won’t fit in memory, but will fit on disk
source
degree destination nodes
node
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
Basic Algorithm
 Assume we have enough RAM to fit rnew, plus
some working memory
 Store rold and matrix M on disk

Basic Algorithm:
 Initialize: rold = [1/N]N
 Iterate:
 Update: Perform a sequential scan of M and rold to
update rnew
 Write out rnew to disk as rold for next iteration
 Every few iterations, compute |rnew-rold| and stop if it
is below threshold
 Need to read in both vectors into memory
Update step

Initialize all entries of rnew to (1-)/N

For each page p (out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)
for j = 1..n:
rnew(destj) += *rold(p)/n

rnew src degree destination rold

0 0 3 1, 5, 6 0
1 1
1 4 17, 64, 113, 117 2
2
3 2 2 13, 23 3
4 4
5 5
6 6
Analysis
 In each iteration, we have to:
 Read rold and M
 Write rnew back to disk
 IO Cost = 2|r| + |M|
 What if we had enough memory to fit
both rnew and rold?
 What if we could not even fit rnew in
memory?
 10 billion pages
Block-based update algorithm

rnew src degree destination rold

0 0 4 0, 1, 3, 5 0
1 1
1 2 0, 5 2
2 2 2 3, 4 3
4
3
5
4
5
Analysis of Block Update
 Similar to nested-loop join in databases
 Break rnew into k blocks that fit in memory
 Scan M and rold once for each block
 k scans of M and rold
 k(|M| + |r|) + |r| = k|M| + (k+1)|r|
 Can we do better?
 Hint: M is much bigger than r (approx
10-20x), so we must avoid reading it k
times per iteration
Block-Stripe Update algorithm
src degree destination
r new
0 4 0, 1
0
1 1 3 0 rold
2 2 1 0
1
2
2 0 4 3 3
4
3 2 2 3
5

0 4 5
4 1 3 5
5
2 2 4
Block-Stripe Analysis
 Break M into stripes
 Each stripe contains only destination nodes
in the corresponding block of rnew
 Some additional overhead per stripe
 But usually worth it
 Cost per iteration
 |M|(1+) + (k+1)|r|
Next
 Topic-Specific Page Rank
 Hubs and Authorities
 Spam Detection

CSF-469-L11-13 (Link Analysis Page Rank)
No ratings yet
CSF-469-L11-13 (Link Analysis Page Rank)
47 pages
Lecture9
No ratings yet
Lecture9
64 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Link Analysis 1
No ratings yet
Link Analysis 1
101 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
math551lab12
No ratings yet
math551lab12
5 pages
Cse535 Link Analysis
No ratings yet
Cse535 Link Analysis
19 pages
14-link-1 - converted
No ratings yet
14-link-1 - converted
10 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
pagerank
No ratings yet
pagerank
3 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Power Point
No ratings yet
Power Point
77 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
16pagerank_241208_110435
No ratings yet
16pagerank_241208_110435
33 pages
PageRank_2021
No ratings yet
PageRank_2021
55 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
link analysis
No ratings yet
link analysis
37 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Google Pagerank: The World'S Largest Matrix Computation
No ratings yet
Google Pagerank: The World'S Largest Matrix Computation
13 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Network Analysis and Mining: Pagerank
No ratings yet
Network Analysis and Mining: Pagerank
17 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Report PDF
No ratings yet
Report PDF
35 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
Lecture11_PageRank_V0
No ratings yet
Lecture11_PageRank_V0
38 pages
04 Pagerank
No ratings yet
04 Pagerank
64 pages
Markov Chains
No ratings yet
Markov Chains
37 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
Lecture 4
No ratings yet
Lecture 4
3 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
44 pages
Algorithms
No ratings yet
Algorithms
49 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
pracEx05
No ratings yet
pracEx05
23 pages
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
Mini-Project #3 - Pagerank: 1 Motivation
No ratings yet
Mini-Project #3 - Pagerank: 1 Motivation
3 pages
Lect 14-Web Ranking
No ratings yet
Lect 14-Web Ranking
30 pages
How Works: M. Ram Murty, FRSC Queen's Research Chair Queen's University
No ratings yet
How Works: M. Ram Murty, FRSC Queen's Research Chair Queen's University
29 pages
The Use of The Linear Algebra by Web Search Engines
No ratings yet
The Use of The Linear Algebra by Web Search Engines
5 pages
Graph Help Session
No ratings yet
Graph Help Session
27 pages
Artificial Intelligence: Pathfinding
No ratings yet
Artificial Intelligence: Pathfinding
33 pages
EXP-11-Implementation of Page Rank Algorithm_3310b1c042081e8632fac151dbbd2a7e
No ratings yet
EXP-11-Implementation of Page Rank Algorithm_3310b1c042081e8632fac151dbbd2a7e
8 pages
Google PageRank
No ratings yet
Google PageRank
22 pages
DWM Expt9
No ratings yet
DWM Expt9
6 pages
hg
No ratings yet
hg
4 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
Hidden Markov Model: Fundamentals and Applications
From Everand
Hidden Markov Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Mechanics of Materials: References
No ratings yet
Mechanics of Materials: References
15 pages
Testbank for Business Communication a ProblemSolving Approach 2nd Edition Rentz
No ratings yet
Testbank for Business Communication a ProblemSolving Approach 2nd Edition Rentz
18 pages
Applying Mathematics: Immersion, Inference, Interpretation Otavio Bueno - The latest ebook is available, download it today
100% (1)
Applying Mathematics: Immersion, Inference, Interpretation Otavio Bueno - The latest ebook is available, download it today
60 pages
Predictive Analytics Real Life Use Cases
No ratings yet
Predictive Analytics Real Life Use Cases
10 pages
Common Splicing and Joints
No ratings yet
Common Splicing and Joints
15 pages
Flow Measurement
No ratings yet
Flow Measurement
16 pages
Carbon Trading - Krishna Kumar
No ratings yet
Carbon Trading - Krishna Kumar
8 pages
Afae 011
No ratings yet
Afae 011
12 pages
Lab 1 Osborne Reynolds
No ratings yet
Lab 1 Osborne Reynolds
11 pages
TOEFL Reading B 0001 NEW
No ratings yet
TOEFL Reading B 0001 NEW
5 pages
Control Systems Lab Manual
100% (1)
Control Systems Lab Manual
67 pages
Week 3 - Practice Class Plan For Managers
No ratings yet
Week 3 - Practice Class Plan For Managers
30 pages
My 120-Day Garden Journal Progress Report: Department of Education
No ratings yet
My 120-Day Garden Journal Progress Report: Department of Education
4 pages
16272
No ratings yet
16272
6 pages
Organic
No ratings yet
Organic
5 pages
Log Data
No ratings yet
Log Data
25 pages
Color in Computer Graphic
No ratings yet
Color in Computer Graphic
98 pages
Housekeepung DLP
No ratings yet
Housekeepung DLP
6 pages
Unit 4 MPE
No ratings yet
Unit 4 MPE
80 pages
Nozzle Pro Report
No ratings yet
Nozzle Pro Report
7 pages
Lesson Plan Direct Variation
No ratings yet
Lesson Plan Direct Variation
7 pages
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
No ratings yet
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
8 pages
Software en 2
No ratings yet
Software en 2
24 pages
Problem 9.1: Scalar QED
No ratings yet
Problem 9.1: Scalar QED
15 pages
Wiley Chapter 4
No ratings yet
Wiley Chapter 4
4 pages
Buy ebook 978-0133507331 Quantitative Analysis for Management (12th Edition) cheap price
100% (2)
Buy ebook 978-0133507331 Quantitative Analysis for Management (12th Edition) cheap price
55 pages
Bio 6 Half Yearly
No ratings yet
Bio 6 Half Yearly
3 pages
Hall Effect
No ratings yet
Hall Effect
4 pages
Internship Report Format (Media Sciences & Computer Science)
No ratings yet
Internship Report Format (Media Sciences & Computer Science)
10 pages
12 Questions That Will Change Your Life!
No ratings yet
12 Questions That Will Change Your Life!
1 page