Lecture11_PageRank_V0
Lecture11_PageRank_V0
Lecture 11 PageRank
HUANG Xiao
[email protected]
n
k
tro
on
or
Ove
F1
a ti
od ep
tw
ne
sco aluat
nc rc
ag
rf i t
)
l ) rk
er
e
ra eries wo
ul prop
p
re, ion m valida
(Ev
t
ti n
u e
(a er
e -s ln
t n , time eura
oe
pre
ay
g&
ck
ut
n
ti l
e t n
rr (tex onal age)
Ba
cisi etrics)
cro
c u t i (m
M
i
Re nvolu ce
on,
du
ss
e
Co pR
rec
Ma
a ll
Dimensionality reduction oop
ing
Had
tion
(autoencoder, SVD)
arn
Clustering: K-means Un
p le
s
leaupe
De e
rn rvi Large-scale data
in s e
g d analytics systems Volume
r vised
fi e Superning Machine Velocity
s si lear Var
c la e learning Big Data Characteristics tim iety
n in e-
s io a ch Analytics of big data Ver series, (i tabular
s acit mage , text
re rm y , gr ,
reg c to Basic statistical
a ph)
ic ve
gi st o rt analysis
Lo pp Graph Applications: AI
ChatGPT
Su Alph
on
es ar
Al
gr ne
si
D)
p
Re Au ha
ce
re Li
n( e
Fo
SV
c to
io lu
es
om no ld
Fa
Web sea
2
d
cia
m ou
nt
po ar
en
lr
sd
ie
ec
de
m ul
riv
ad
og
rs in
co ng
tri x
n
Gr
ys
iti
rc
de Si
on
te
h
ma
m C n
Reco
n
te
ommnt-base
actio
y
end d
enc
Co
nk
a ti o
inter
ork
lla n
Pag
a
jac
Factoriza
fil bo
eR
netw
eRa
-item
te ra
Ad
(SVD)
ri ti
Pag
nk
ng ve
User
tion
Map
Red
uce
COMP4434 3
New Jersey Institute of Technology
Challenges in Web Search
COMP4434 4
New Jersey Institute of Technology
Hint: Web as a Directed Graph
§ Nodes: Webpages
§ Edges: Hyperlinks
I teach Big
Data in
COMP
COMP is in
Faculty of
Engineering The Hong
Kong
Polytechnic
University
COMP4434 5
New Jersey Institute of Technology
Web as a Directed Graph
COMP4434 6
New Jersey Institute of Technology
Ranking Nodes on the Graph
COMP4434 7
New Jersey Institute of Technology
Example of Node Ranking
§ Page Ranking
§ Social Ranking
§ Paper Ranking
§ Scholar Ranking
§ ……
COMP4434 8
New Jersey Institute of Technology
Idea: Links as votes
COMP4434 9
New Jersey Institute of Technology
Google PageRank
COMP4434 10
New Jersey Institute of Technology
Is Page == “Webpage”?
COMP4434 11
New Jersey Institute of Technology
Simple Recursive Formulation
i k
ri/3
rk/4
rj/3 rj/3
COMP4434 12
New Jersey Institute of Technology
How to Represent a Graph
§ Graph model 𝐺 = 𝑉, 𝐸
§ 𝑉 is a set of pages
§ 𝐸 is a set of edges
§ Each edge 𝑢, 𝑣 ∈ 𝐸 represents that
page 𝑢 points/references to page 𝑣
§ Adjacent List
§ A data structure for a graph
§ 𝐴𝑑𝑗 𝑢 = 𝑣: 𝑢, 𝑣 ∈ 𝐸 contains
each vertex 𝑣 being adjacent to 𝑢
§ Example: 𝐴𝑑𝑗 2 = {3, 4}
COMP4434 13
New Jersey Institute of Technology
PageRank: The “Flow” Model
i® j di
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of node 𝒊
rm = ra /2
COMP4434 14
New Jersey Institute of Technology
Solving the Flow Equations
Flow equations:
ry = ry /2 + ra /2
§ 3 equations, 3 unknowns,
no constants ra = ry /2 + rm
rm = ra /2
§ No unique solution
§ All solutions equivalent modulo the scale factor
§ Additional constraint forces uniqueness:
§ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
§ Solution: 𝒓𝒚 = , 𝒓𝒂 = , 𝒓𝒎 =
𝟓 𝟓 𝟓
§ But, we need a better method for large web-size graphs
COMP4434 15
New Jersey Institute of Technology
PageRank: Matrix Formulation
j rj
. =
ri
ri
1/3 rj = å
i® j di
M . r = r
COMP4434 17
New Jersey Institute of Technology
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
r = M·r
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
COMP4434 18
New Jersey Institute of Technology
Power Iteration Method
§ Given a web graph with N nodes, where the nodes are pages
and edges are hyperlinks
§ Power iteration: a simple iterative scheme
§ Suppose there are N web pages
§ Initialize: r(0) = [1/N,….,1/N]T (t )
r
å
( t +1)
§ Iterate: r(t+1) = M · r(t) rj = i
COMP4434 19
New Jersey Institute of Technology
PageRank: How to solve?
§ Power Iteration: y a m
y ½ ½ 0
§ Set 𝑟( = 1/N
y a ½ 0 1
*
§ 1: 𝑟′( = ∑'→( # m 0 ½ 0
+#
§ 2: 𝑟 = 𝑟′ a m
§ Go to 1 ry = ry /2 + ra /2
ra = ry /2 + rm
§ Example: rm = ra /2
Iteration 0, 1, 2, …
COMP4434 20
New Jersey Institute of Technology
Why Power Iteration works? (1) Details!
§ Power iteration:
A method for finding dominant eigenvector (the vector
corresponding to the largest eigenvalue)
§ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
§ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏 = 𝑴 𝑴𝒓 𝟎 = 𝑴𝟐 ⋅ 𝒓 𝟎
§ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐 = 𝑴 𝑴𝟐 𝒓 𝟎 = 𝑴𝟑 ⋅ 𝒓 𝟎
§ Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the
dominant eigenvector of 𝑴 (𝑴 is stochastic/Markov matrix)
§ NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑴𝒙 = 𝝀𝒙
Optimal r is the first or principal eigenvector of M, with
corresponding eigenvalue 1
COMP4434 21
New Jersey Institute of Technology
Why Power Iteration works? (2) Details!
(t )
ri
=å
( t +1)
rj
i® j di
or
equivalently
r = Mr
§ Does this converge?
COMP4434 24
New Jersey Institute of Technology
Does this converge?
(t )
ri
=å
( t +1)
a b rj
i® j di
§ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
COMP4434 25
New Jersey Institute of Technology
Does it converge to what we want?
(t )
ri
=å
( t +1)
a b rj
i® j di
§ Example:
ra 1 0 0 0
rb = 0 1 0 0
Iteration 0, 1, 2, …
COMP4434 26
New Jersey Institute of Technology
PageRank: Problems
COMP4434 27
New Jersey Institute of Technology
Problem: Spider Traps
§ Power Iteration: y a m
y
y ½ ½ 0
§ Set 𝑟( = 1
a ½ 0 0
*
§ 𝑟( = ∑'→( # a m m 0 ½ 1
+#
§ And iterate m is a spider trap
ry = ry /2 + ra /2
ra = ry /2
§ Example: rm = ra /2 + rm
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
§ The Google solution for spider traps: At each time step, the
“vote” has two options
§ With prob. b, follow a link at random
§ With prob. 1-b, jump to some random page
§ Common values for b are in the range 0.8 to 0.9
§ “Vote” will teleport out of spider trap
within a few time steps
y y
a m a m
COMP4434 29
New Jersey Institute of Technology
Problem: Dead Ends
§ Power Iteration: y a m
y
y ½ ½ 0
§ Set 𝑟( = 1
a ½ 0 0
*
§ 𝑟( = ∑'→( # a m m 0 ½ 0
+#
§ And iterate
ry = ry /2 + ra /2
ra = ry /2
§ Example: rm = ra /2
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic. 30
New Jersey Institute of Technology
Solution: Always Teleport!
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
COMP4434 31
New Jersey Institute of Technology
Why Teleports Solve the Problem?
COMP4434 32
New Jersey Institute of Technology
Solution: Random Teleports
§ Google’s solution that does it all:
At each step, random surfer has two options:
§ With probability b, follow a link at random
§ With probability 1-b, jump to some random page
"→!
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
COMP4434 33
New Jersey Institute of Technology
The Google Matrix
COMP4434 34
New Jersey Institute of Technology
Random Teleports (b = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1
15
5
7/1
1/
13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15
A
Reduce(key, values) {
// key: a page,
// values: a list of page ranks from all its incoming pages
PR(key)=1-b;
For each pagerank in values
PR(key) = PR(key) + b*pagerank;
emit(key, PR(key));
}
COMP4434 36
New Jersey Institute of Technology
MapReduce Program for PageRank
ABCD A1 B 1/3
A 1/2
BAD B1 C 1/3
Map Reduce A 1/2
CAB C1 D 1/3
DBC D1
A 1/2 B 1/3
Links.txt Initial PR D 1/2 B 1/2
B 1/2
A 1/2
C 1/3
B 1/2
C 1/2
B 1/2
C 1/2 D 1/3
D 1/2
COMP4434 37
New Jersey Institute of Technology
Web Search Engines
§ Indexer
§ Process the retrieved
pages/documents and
represents them in
efficient search data
structures (inverted files)
§ Query Server
§ Accept the query from the
user and return the result
pages by consulting the
search data structure
COMP4434 38
New Jersey Institute of Technology