PageRank_2021
PageRank_2021
Thoai Nam
High Performance Compu2ng Lab (HPC Lab)
Faculty of Computer Science and Technology
HCMC University of Technology
HPC Lab-CSE-HCMUT 1
PageRank
§ Applica2ons
§ PageRank formula2on
§ Google PageRank
HPC Lab-CSE-HCMUT 2
Graph Data: Social Networks
HPC Lab-CSE-HCMUT 3
Graph
Data:
Media
Networks
HPC Lab-CSE-HCMUT 4
Graph Data: Information Nets
HPC Lab-CSE-HCMUT 5
Graph Data: Communication Nets
domain2
domain1
router
domain3
Internet
HPC Lab-CSE-HCMUT 6
Graph Data: Technological Networks
HPC Lab-CSE-HCMUT 7
Web
as
a
Directed
Graph
HPC Lab-CSE-HCMUT 8
Broad Question
§ But: Web is huge, full of untrusted documents, random things, web spam, etc.
HPC Lab-CSE-HCMUT 9
Web Search: 2 Challenges
HPC Lab-CSE-HCMUT 10
Ranking Nodes on the Graph
HPC Lab-CSE-HCMUT 11
Link Analysis Algorithms
§ We will cover the following Link Analysis approaches for compu2ng
importance of nodes in a graph:
o PageRank
o Topic-‐Specific (Personalized) PageRank
o Web Spam Detec2on Algorithms
HPC Lab-CSE-HCMUT 12
Links as Votes
§ Idea: Links as votes
o Page is more important if it has more links
• In-‐coming links? Out-‐going links?
NOTE: 𝒙 is an
eigenvector with the
corresponding
eigenvalue 𝝀 if:
𝑨𝒙 = 𝝀𝒙
HPC Lab-CSE-HCMUT 15
Example: PageRank Scores
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
HPC Lab-CSE-HCMUT 16
Simple Recursive Formulation
§ Each link’s vote is propor2onal to the importance of its source page
§ If page j with importance rj has n out-‐links, each link gets rj / n votes
§ Page j ’s own importance is the sum of the votes on its in-‐links
i
k
ri/3 rk/4
j rj/3
rj = ri/3 + rk/4
rj/3 rj/3
HPC Lab-CSE-HCMUT 17
PageRank: The “Flow” Model
§ A “vote” from an important page is worth more y/2
§ A page is important if it is pointed to by other y
important pages
a/2
§ Define a “rank” rj for page j y/2
m
ri a m
rj = ∑ a/2
i→ j di
“Flow”
equa0ons:
di
...
out-degree of node i ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
HPC Lab-CSE-HCMUT 18
Solving the Flow Equations
§ 3 equa2ons, 3 unknowns, no constants Flow
equa0ons:
o No unique solu2on ry = ry /2 + ra /2
o All solu2ons equivalent module the scale factor ra = ry /2 + rm
§ Addi2onal constraint forces uniqueness: rm = ra /2
o ry + ra + rm = 1
o Solu2on: ry = 2/5, ra = 2/5, rm = 1/5
§ Gaussian elimina2on method works for small examples, but we need a
beder method for large web-‐size graphs
§ We need a new formula2on!
HPC Lab-CSE-HCMUT 19
PageRank: Matrix formulation
§ Stochastic adjacency matrix 𝑴
• Let page 𝑖 has 𝑑𝑖 out-links
• If 𝑖→𝑗, then 𝑀𝑗𝑖 = 1/𝑑𝑖 else 𝑀𝑗𝑖 = 0
o 𝑴 is a column stochastic matrix
Ø Columns sum to 1 i j
§ Rank vector 𝒓:: vector with an entry per page ri rj
o 𝑟𝑖 is the importance score of page 𝑖: ∑𝑖 𝑟𝑖 = 1
§ The flow equations can be written
Out-‐going
links
of
Page
𝑖
𝑟 = 𝑀. 𝑟 𝑖
𝑗 𝑟𝑗 𝑟𝑗
x =
In-‐coming
links
of
Page
j 𝑟𝑖
𝑀 𝑟 𝑟
HPC Lab-CSE-HCMUT 20
Eigenvector Formulation
• The flow equations can be written
HPC Lab-CSE-HCMUT 21
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
r
=
M·∙r
ry = ry /2 + ra /2
ra = ry /2 + rm y ½ ½ 0 y
rm = ra /2 a = ½ 0 1 a
m 0 ½ 0 m
HPC Lab-CSE-HCMUT 22
Eigenvector formulation
§ The flow equations can be written
𝒓 = 𝑀. 𝒓
§ So the rank vector 𝒓 is an eigenvector of the stochastic web matrix 𝑀
o Starting from any stochastic vector 𝒖,, the limit 𝑴(𝑴(... 𝑴(𝑴 𝒖))) is the long-term
distribution of the surfers.
o The math: limiting distribution = principal eigenvector of 𝑀 = PageRank
o Note: If 𝒓 is the limit of 𝑀𝑀...𝑀𝒖,, then 𝒓 satisfies the equation 𝒓 = 𝑴.𝒓,,
HPC Lab-CSE-HCMUT 23
Power iteration method
§ Given a web graph with N nodes, where the nodes are pages and edges
are hyperlinks
§ Power iteration: a simple iterative scheme
• Suppose there are N web pages
• Initialize: r(0) = [1/N,...,1/N]T
• Iterate: r(t+1) = M · r(t) 𝒅𝒊 ... out-degree of node 𝒊
HPC Lab-CSE-HCMUT 24
PageRank: How to solve?
HPC Lab-CSE-HCMUT 25
PageRank: How to solve?
HPC Lab-CSE-HCMUT 26
The Stationary Distribution (1)
HPC Lab-CSE-HCMUT 27
The Stationary Distribution (2)
HPC Lab-CSE-HCMUT 28
Existence and Uniqueness
§ A central result from the theory of random walks (a.k.a. Markov
processes):
HPC Lab-CSE-HCMUT 29
PageRank:
The Google Formulation
HPC Lab-CSE-HCMUT 30
PageRank: Three Questions
(t )
( t +1) ri or
rj =∑ equivalently r = Mr
i→ j di
HPC Lab-CSE-HCMUT 31
Does this converge?
(t )
( t +1) ri
a b rj =∑
i→ j di
• Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
HPC Lab-CSE-HCMUT 32
Does it converge to what we want?
(t )
( t +1) ri
a b rj =∑
i→ j di
• Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
HPC Lab-CSE-HCMUT 33
PageRank problems
(1) Dead ends: Some pages have no out-links
o Random walk has “nowhere” to go to
o Such pages cause importance to “leak out”
(2) Spider traps:
(all out-links are within the group)
o Random walk gets “stuck” in a trap
o And eventually spider traps absorb all importance
Dead end
Spider trap
HPC Lab-CSE-HCMUT 34
Problem: Spider Traps
HPC Lab-CSE-HCMUT 35
Solution: Teleports
§ The Google solution for spider traps: At each time step, the random surfer
has two options
• With prob. β, follow a link at random
• With prob. 1-β, jump to some random page
• β is typically in the range 0.8 to 0.9
§ Surfer will teleport out of spider trap within a few time steps
Dead end
Spider trap
HPC Lab-CSE-HCMUT 36
Problem: Dead Ends
HPC Lab-CSE-HCMUT 37
Solution: Always Teleport!
§ Teleports: Follow random teleport links with probability 1.0 from
dead-ends
• Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
HPC Lab-CSE-HCMUT 38
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
§ Spider-traps are not a problem, but with traps PageRank scores
are not what we want
o Solution: Never get stuck in a spider trap by teleporting out of it in a finite
number of steps
§ Dead-ends are a problem
o The matrix is not column stochastic so our initial assumptions are not met
o Solution: Make matrix column stochastic by always teleporting when there
is nowhere else to go
HPC Lab-CSE-HCMUT 39
Solution: Random Teleports
HPC Lab-CSE-HCMUT 40
The Google matrix
§ PageRank equation [Brin-Page, ‘98]
HPC Lab-CSE-HCMUT 41
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8
1/2 0 0 +
0.2
1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
HPC Lab-CSE-HCMUT 43
Computing Page Rank
§ Key step is matrix-vector multiplication
• rnew = A · rold
§ Easy if we have enough main memory to hold A, rold, rnew
§ Say N = 1 billion pages A
=
β·∙M
+
(1-‐β)
[1/N]NxN
• We need 4 bytes for each entry (say) ½ ½ 0 1/3 1/3 1/3
A
=
0.8
½ 0 0 +0.2
1/3 1/3 1/3
• 2 billion entries for vectors, approx 8GB 0 ½ 1 1/3 1/3 1/3
HPC Lab-CSE-HCMUT 44
Rearranging the Equation
HPC Lab-CSE-HCMUT 45
Sparse Matrix Formulation
HPC Lab-CSE-HCMUT 46
PageRank: The Complete Algorithm
HPC Lab-CSE-HCMUT 47
Sparse Matrix Encoding
§ Encode sparse matrix using only nonzero entries
• Space proportional roughly to number of links
• Say 10N, or 4*10*1 billion = 40GB
• Still won’t fit in memory, but will fit on disk
source
node degree destination nodes
0
3
1, 5, 7
1
5
17, 64, 113, 117, 245
2
2
13, 23
HPC Lab-CSE-HCMUT 48
Basic Algorithm: Update Step
§ Assume enough RAM to fit rnew into memory
o Store rold and matrix M on disk
§ 1 step of power-iteration is:
Initialize all entries of rnew = (1-β) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += β rold(i) / di
0
rnew source degree destination rold 0
1
1
0
3
1, 5, 6
2
2
3
1
4
17, 64, 113, 117 3
4
4
2
2
13, 23
5
5
6
HPC Lab-CSE-HCMUT
6
49
Analysis
§ Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
§ In each iteration, we have to:
• Read rold and M
• Write rnew back to disk
• Cost per iteration of Power method:
= 2|r| + |M|
§ Question:
• What if we could not even fit rnew in memory?
HPC Lab-CSE-HCMUT 50
Block-based Update Algorithm
HPC Lab-CSE-HCMUT 51
Analysis of Block Update
§ Similar to nested-loop join in databases
• Break rnew into k blocks that fit in memory
• Scan M and rold once for each block
§ Total cost:
• k scans of M and rold
• Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
§ Can we do better?
• Hint: M is much bigger than r (approx 10-20x), so we must avoid
reading it k times per iteration
HPC Lab-CSE-HCMUT 52
Block-Stripe Update Algorithm
src
degree
desXnaXon
rnew
0
4
0, 1
0
1
1
3
0 rold
0
2
2
1 1
2
2
0
4
3 3
4
3
2
2
3 5
0
4
5
4
5
1
3
5
2
2
4
Break M into stripes! Each stripe contains only destination nodes in the
corresponding block of rnew
HPC Lab-CSE-HCMUT 53
Block-Stripe Analysis
§ Break M into stripes
• Each stripe contains only destination nodes
in the corresponding block of rnew
§ Some additional overhead per stripe
• But it is usually worth it
§ Cost per iteration of Power method:
=|M|(1+ε) + (k+1)|r|
HPC Lab-CSE-HCMUT 54
Some Problems with Page Rank
§ Measures generic popularity of a page
• Biased against topic-specific authorities
• Solution: Topic-Specific PageRank (next)
§ Uses a single measure of importance
• Other models of importance
• Solution: Hubs-and-Authorities
§ Susceptible to Link spam
• Artificial link topographies created in order to boost page rank
• Solution: TrustRank
HPC Lab-CSE-HCMUT 55