0% found this document useful (0 votes)
15 views

PageRank_2021

The document discusses PageRank, a link analysis algorithm used to rank web pages based on their importance determined by link structures. It covers various aspects of graph data, challenges in web search, and the mathematical formulation of PageRank, including its eigenvector formulation and the power iteration method for computation. The document emphasizes the significance of links as votes and the recursive nature of determining page importance.

Uploaded by

pvnkhanh.sdh242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

PageRank_2021

The document discusses PageRank, a link analysis algorithm used to rank web pages based on their importance determined by link structures. It covers various aspects of graph data, challenges in web search, and the mathematical formulation of PageRank, including its eigenvector formulation and the power iteration method for computation. The document emphasizes the significance of links as votes and the recursive nature of determining page importance.

Uploaded by

pvnkhanh.sdh242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

PageRank

Thoai Nam
High  Performance  Compu2ng  Lab  (HPC  Lab)
Faculty  of  Computer  Science  and  Technology
HCMC  University  of  Technology
HPC Lab-CSE-HCMUT 1
PageRank  
§ Applica2ons
§ PageRank  formula2on
§  Google  PageRank

HPC Lab-CSE-HCMUT 2
Graph Data: Social Networks

Facebook social graph


4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

HPC Lab-CSE-HCMUT 3
Graph  Data:  Media  Networks  

Connections between political blogs


Polarization of the network [Adamic-Glance, 2005]

HPC Lab-CSE-HCMUT 4
Graph Data: Information Nets

Citation networks and Maps of science


[Börner et al., 2012]

HPC Lab-CSE-HCMUT 5
Graph Data: Communication Nets

domain2

domain1

router

domain3

Internet  
HPC Lab-CSE-HCMUT 6
Graph Data: Technological Networks

Seven Bridges of Königsberg


[Euler, 1735]
Return to the starting point by traveling each link of the graph once
and only once.

HPC Lab-CSE-HCMUT 7
Web  as  a  Directed  Graph  

HPC Lab-CSE-HCMUT 8
Broad Question

§ How  to  organize  the  Web?


§ First  try:  Human  curated  
Web  directories
§ Yahoo,  DMOZ,  LookSmart

§ Second  try:  Web  Search


§ Informa2on  Retrieval  inves2gates:  
Find  relevant  docs  in  a  small  and  trusted  set
§ Newspaper  ar2cles,  Patents,  etc.

§ But:  Web  is  huge,  full  of  untrusted  documents,  random  things,  web  spam,  etc.

HPC Lab-CSE-HCMUT 9
Web Search: 2 Challenges

2  challenges  of  web  search:


(1)  Web  contains  many  sources  of  informa2on  
Who  to  “trust”?
• Trick:  Trustworthy  pages  may  point  to  each  other!

(2)  What  is  the  “best”  answer  to  query  “newspaper”?


• No  single  right  answer
• Trick:  Pages  that  actually  know  about  newspapers  might  all  be  poin2ng  to  many  newspapers

HPC Lab-CSE-HCMUT 10
Ranking Nodes on the Graph

§ All  web  pages  are  not  equally  “important”


www.new-­‐page.com  vs.  www.stanford.edu
§ There  is  large  diversity  in  the  web-­‐graph    
node  connec2vity  
Let’s  rank  the  pages  by  the  link  structure!

HPC Lab-CSE-HCMUT 11
Link Analysis Algorithms
§ We  will  cover  the  following  Link  Analysis  approaches  for  compu2ng  
importance  of  nodes  in  a  graph:
o PageRank
o Topic-­‐Specific  (Personalized)  PageRank
o Web  Spam  Detec2on  Algorithms

HPC Lab-CSE-HCMUT 12
Links as Votes
§ Idea:  Links  as  votes
o Page  is  more  important  if  it  has  more  links
• In-­‐coming  links?  Out-­‐going  links?

§ Think  of  in-­‐links  as  votes:


o www.stanford.edu  has  23,400  in-­‐links
o www.new-­‐page.com  has  1  in-­‐link

§ Are  all  in-­‐links  are  equal?


o Links  from  important  pages  count  more
o Recursive  ques2on!  
HPC Lab-CSE-HCMUT 13
Intuition (1)
§ Web  pages  are  important  if  people  visit  them  a  lot
§ But  we  can’t  watch  everybody  using  the  Web
§ A  good  surrogate  for  visi2ng  pages  is  to  assume  people  follow  links  
randomly
§ Leads  to  random  surfer  model:  
o Start  at  a  random  page  and  follow  random  out-­‐links  repeatedly,  from  whatever  
page  you  are  at
o PageRank  =  limi2ng  probability  of  being  at  a  page.   b  
a  =  2  
b  =  1   a   d  
a  → b  →  d  → a → c → d → ...    
c  =  1  
d  =  2   c  
HPC Lab-CSE-HCMUT 14
Intuition (2)
§ Solve  the  recursive  equa2on:  “importance  of  a  page  =  its  share  of  the  
importance  of  each  of  its  predecessor  pages”  
o Equivalent  to  the  random-­‐surfer  defini2on  of  PageRank  
§ Technically,  importance  =  the  principal  eigenvector  of  the  transi2on  
matrix  of  the  Web  
o A  few  fix-­‐ups  needed  

NOTE: 𝒙 is an
eigenvector with the
corresponding
eigenvalue 𝝀 if:
𝑨𝒙 = 𝝀𝒙

HPC Lab-CSE-HCMUT 15
Example: PageRank Scores

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
HPC Lab-CSE-HCMUT 16
Simple Recursive Formulation
§ Each  link’s  vote  is  propor2onal  to  the  importance  of  its  source  page

§ If  page j with  importance  rj  has  n  out-­‐links,  each  link  gets  rj / n  votes

§ Page  j  ’s  own  importance  is  the  sum  of  the  votes  on  its  in-­‐links

i   k  
ri/3 rk/4
j rj/3
rj = ri/3 + rk/4
rj/3 rj/3

HPC Lab-CSE-HCMUT 17
PageRank: The “Flow” Model
§ A  “vote”  from  an  important  page  is  worth  more y/2

§ A  page  is  important  if  it  is  pointed  to  by  other   y
important  pages
a/2
§ Define  a  “rank”  rj for  page j y/2
m
ri a m
rj = ∑ a/2
i→ j di
“Flow”  equa0ons:  
di  ...  out-degree of node i ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

HPC Lab-CSE-HCMUT 18
Solving the Flow Equations
§ 3  equa2ons,  3  unknowns,  no  constants Flow  equa0ons:  
o No  unique  solu2on ry = ry /2 + ra /2
o All  solu2ons  equivalent  module  the  scale  factor ra = ry /2 + rm
§ Addi2onal  constraint  forces  uniqueness: rm = ra /2
o ry + ra + rm = 1
o Solu2on:  ry = 2/5, ra = 2/5, rm = 1/5
§ Gaussian  elimina2on  method  works  for  small  examples,  but  we  need  a  
beder  method  for  large  web-­‐size  graphs
§ We  need  a  new  formula2on!

HPC Lab-CSE-HCMUT 19
PageRank: Matrix formulation
§ Stochastic adjacency matrix 𝑴
• Let page 𝑖 has 𝑑𝑖 out-links
• If 𝑖→𝑗, then 𝑀𝑗𝑖 = 1/𝑑𝑖 else 𝑀𝑗𝑖 = 0
o 𝑴 is a column stochastic matrix
Ø Columns sum to 1 i j
§ Rank vector 𝒓:: vector with an entry per page ri rj
o 𝑟𝑖 is the importance score of page 𝑖: ∑𝑖 𝑟𝑖 = 1
§ The flow equations can be written
Out-­‐going  links  of  Page  𝑖
𝑟 = 𝑀. 𝑟 𝑖

𝑗 𝑟𝑗 𝑟𝑗
x =
In-­‐coming  links  of  Page  j 𝑟𝑖
𝑀 𝑟 𝑟
HPC Lab-CSE-HCMUT 20
Eigenvector Formulation
• The flow equations can be written

• So the rank vector r is an eigenvector of the stochastic web


matrix M
• In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
• Largest eigenvalue of M is 1 since M is
column stochastic (with non-negative entries) NOTE: 𝒙 is an
• We know r is unit length and each column of M eigenvector with the
sums to one, so corresponding
eigenvalue 𝝀 if:
𝑨𝒙 = 𝝀𝒙
• We can now efficiently solve for r!
The method is called Power iteration

HPC Lab-CSE-HCMUT 21
Example: Flow Equations & M

y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r  =  M·∙r
ry = ry /2 + ra /2
ra = ry /2 + rm y ½ ½ 0 y
rm = ra /2 a = ½ 0 1 a
m 0 ½ 0 m

HPC Lab-CSE-HCMUT 22
Eigenvector formulation
§ The flow equations can be written
𝒓 = 𝑀. 𝒓
§ So the rank vector 𝒓 is an eigenvector of the stochastic web matrix 𝑀
o Starting from any stochastic vector 𝒖,, the limit 𝑴(𝑴(... 𝑴(𝑴 𝒖))) is the long-term
distribution of the surfers.
o The math: limiting distribution = principal eigenvector of 𝑀 = PageRank
o Note: If 𝒓 is the limit of 𝑀𝑀...𝑀𝒖,, then 𝒓 satisfies the equation 𝒓 = 𝑴.𝒓,,

so  𝒓  is  an  eigenvector  of  𝑴  with  eigenvalue  1


NOTE: 𝒙 is an
§ We can now efficiently solve for 𝒓!! eigenvector with the
corresponding
The method is called Power iteration eigenvalue 𝝀 if:
𝑨𝒙 = 𝝀𝒙

HPC Lab-CSE-HCMUT 23
Power iteration method
§ Given a web graph with N nodes, where the nodes are pages and edges
are hyperlinks
§ Power iteration: a simple iterative scheme
• Suppose there are N web pages
• Initialize: r(0) = [1/N,...,1/N]T
• Iterate: r(t+1) = M · r(t) 𝒅𝒊 ... out-degree of node 𝒊

• Stop when |r(t+1) – r(t)|1 < ε


About 50 iterations is sufficient to estimate the limiting solution
|x|1  =  ∑1≤i≤N|xi|  is  the  L1  norm  
Can  use  any  other  vector  norm,  e.g.,  Euclidean      

HPC Lab-CSE-HCMUT 24
PageRank: How to solve?

HPC Lab-CSE-HCMUT 25
PageRank: How to solve?

HPC Lab-CSE-HCMUT 26
The Stationary Distribution (1)

HPC Lab-CSE-HCMUT 27
The Stationary Distribution (2)

HPC Lab-CSE-HCMUT 28
Existence and Uniqueness
§ A  central  result  from  the  theory  of  random  walks  (a.k.a.  Markov  
processes):

For  graphs  that  sa2sfy  certain  condi2ons,  the  sta2onary  distribu2on  


is  unique  and  eventually  will  be  reached  no  mader  what  is  the  ini2al  
probability  distribu2on  at  2me  t  =  0

HPC Lab-CSE-HCMUT 29
PageRank:
The Google Formulation

HPC Lab-CSE-HCMUT 30
PageRank: Three Questions

(t )
( t +1) ri or
rj =∑ equivalently r = Mr
i→ j di

ØDoes  this  converge?


ØDoes  it  converge  to  what  we  want?
ØAre  results  reasonable?

HPC Lab-CSE-HCMUT 31
Does this converge?

(t )
( t +1) ri
a b rj =∑
i→ j di

• Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …

HPC Lab-CSE-HCMUT 32
Does it converge to what we want?

(t )
( t +1) ri
a b rj =∑
i→ j di

• Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …

HPC Lab-CSE-HCMUT 33
PageRank problems
(1) Dead ends: Some pages have no out-links
o Random walk has “nowhere” to go to
o Such pages cause importance to “leak out”
(2) Spider traps:
(all out-links are within the group)
o Random walk gets “stuck” in a trap
o And eventually spider traps absorb all importance

Dead end
Spider trap

HPC Lab-CSE-HCMUT 34
Problem: Spider Traps

HPC Lab-CSE-HCMUT 35
Solution: Teleports
§ The Google solution for spider traps: At each time step, the random surfer
has two options
• With prob. β, follow a link at random
• With prob. 1-β, jump to some random page
• β is typically in the range 0.8 to 0.9
§ Surfer will teleport out of spider trap within a few time steps

Dead end
Spider trap

HPC Lab-CSE-HCMUT 36
Problem: Dead Ends

HPC Lab-CSE-HCMUT 37
Solution: Always Teleport!
§ Teleports: Follow random teleport links with probability 1.0 from
dead-ends
• Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓

HPC Lab-CSE-HCMUT 38
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
§ Spider-traps are not a problem, but with traps PageRank scores
are not what we want
o Solution: Never get stuck in a spider trap by teleporting out of it in a finite
number of steps
§ Dead-ends are a problem
o The matrix is not column stochastic so our initial assumptions are not met
o Solution: Make matrix column stochastic by always teleporting when there
is nowhere else to go

HPC Lab-CSE-HCMUT 39
Solution: Random Teleports

HPC Lab-CSE-HCMUT 40
The Google matrix
§ PageRank equation [Brin-Page, ‘98]

§ The Google Matrix A:


[1/N]NxN...N by N matrix where all
entries are 1/N

§ We have a recursive problem: 𝑟 = A . 𝑟


And the Power method still works!
§ What is β?
o In practice β = 0.8, 0.9 (jump every 5 steps on avg.)

HPC Lab-CSE-HCMUT 41
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8   1/2 0 0 +  0.2   1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15


13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
HPC Lab-CSE-HCMUT 42
How do we actually compute the
PageRank?

HPC Lab-CSE-HCMUT 43
Computing Page Rank
§ Key step is matrix-vector multiplication
• rnew = A · rold
§ Easy if we have enough main memory to hold A, rold, rnew
§ Say N = 1 billion pages A  =  β·∙M  +  (1-­‐β)  [1/N]NxN  
• We need 4 bytes for each entry (say) ½ ½ 0 1/3 1/3 1/3
A  =   0.8   ½ 0 0 +0.2   1/3 1/3 1/3
• 2 billion entries for vectors, approx 8GB 0 ½ 1 1/3 1/3 1/3

• Matrix A has N2 entries


7/15 7/15 1/15
• 1018 is a large number!
 =   7/15 1/15 1/15
1/15 7/15 13/15

HPC Lab-CSE-HCMUT 44
Rearranging the Equation

HPC Lab-CSE-HCMUT 45
Sparse Matrix Formulation

HPC Lab-CSE-HCMUT 46
PageRank: The Complete Algorithm

HPC Lab-CSE-HCMUT 47
Sparse Matrix Encoding
§ Encode sparse matrix using only nonzero entries
• Space proportional roughly to number of links
• Say 10N, or 4*10*1 billion = 40GB
• Still won’t fit in memory, but will fit on disk

source
node degree destination nodes
0   3   1, 5, 7
1   5   17, 64, 113, 117, 245
2   2   13, 23

HPC Lab-CSE-HCMUT 48
Basic Algorithm: Update Step
§ Assume enough RAM to fit rnew into memory
o Store rold and matrix M on disk
§ 1 step of power-iteration is:
Initialize all entries of rnew = (1-β) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += β rold(i) / di
0   rnew source degree destination rold 0  
1   1  
0   3   1, 5, 6
2   2  
3   1   4   17, 64, 113, 117 3  
4   4  
2   2   13, 23
5   5  
6   HPC Lab-CSE-HCMUT
6   49
Analysis
§ Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
§ In each iteration, we have to:
• Read rold and M
• Write rnew back to disk
• Cost per iteration of Power method:
= 2|r| + |M|
§ Question:
• What if we could not even fit rnew in memory?

HPC Lab-CSE-HCMUT 50
Block-based Update Algorithm

rnew src   degree   des0na0on   rold


0   0  
0   4   0, 1, 3, 5 1  
1  
1   2   0, 5 2  
2   3  
2   2   3, 4 4  
3  
M 5  
4  
5  

§ Break rnew into k blocks that fit in memory


§ Scan M and rold once for each block

HPC Lab-CSE-HCMUT 51
Analysis of Block Update
§ Similar to nested-loop join in databases
• Break rnew into k blocks that fit in memory
• Scan M and rold once for each block
§ Total cost:
• k scans of M and rold
• Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
§ Can we do better?
• Hint: M is much bigger than r (approx 10-20x), so we must avoid
reading it k times per iteration

HPC Lab-CSE-HCMUT 52
Block-Stripe Update Algorithm
src   degree   desXnaXon  
rnew
0   4   0, 1
0  
1   1   3   0 rold
0  
2   2   1 1  
2  
2   0   4   3 3  
4  
3  
2   2   3 5  

0   4   5
4  
5  
1   3   5
2   2   4
Break M into stripes! Each stripe contains only destination nodes in the
corresponding block of rnew
HPC Lab-CSE-HCMUT 53
Block-Stripe Analysis
§ Break M into stripes
• Each stripe contains only destination nodes
in the corresponding block of rnew
§ Some additional overhead per stripe
• But it is usually worth it
§ Cost per iteration of Power method:
=|M|(1+ε) + (k+1)|r|

HPC Lab-CSE-HCMUT 54
Some Problems with Page Rank
§ Measures generic popularity of a page
• Biased against topic-specific authorities
• Solution: Topic-Specific PageRank (next)
§ Uses a single measure of importance
• Other models of importance
• Solution: Hubs-and-Authorities
§ Susceptible to Link spam
• Artificial link topographies created in order to boost page rank
• Solution: TrustRank

HPC Lab-CSE-HCMUT 55

You might also like