0% found this document useful (0 votes)
3 views

Lecture11_PageRank_V0

The document discusses the PageRank algorithm, which is used to rank web pages based on their importance determined by the link structure of the web. It explains how PageRank treats links as votes, where incoming links from important pages contribute more to a page's rank. The document also outlines the mathematical formulation of PageRank, including the use of stochastic matrices and the power iteration method for calculating the rank of web pages.

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture11_PageRank_V0

The document discusses the PageRank algorithm, which is used to rank web pages based on their importance determined by the link structure of the web. It explains how PageRank treats links as votes, where incoming links from important pages contribute more to a page's rank. The document also outlines the mathematical formulation of PageRank, including the use of stochastic matrices and the power iteration method for calculating the rank of web pages.

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

COMP4434 Big Data Analytics

Lecture 11 PageRank
HUANG Xiao
[email protected]
n
k

tro
on
or

Ove

F1

a ti

od ep
tw
ne

sco aluat

nc rc
ag
rf i t

)
l ) rk

er
e
ra eries wo

ul prop

p
re, ion m valida
(Ev
t

ti n
u e

(a er
e -s ln
t n , time eura

oe
pre

ay
g&

ck

ut
n

ti l
e t n
rr (tex onal age)

Ba
cisi etrics)
cro
c u t i (m

M
i
Re nvolu ce

on,
du

ss
e
Co pR

rec
Ma

a ll
Dimensionality reduction oop

ing
Had

tion
(autoencoder, SVD)

arn
Clustering: K-means Un

p le
s
leaupe

De e
rn rvi Large-scale data
in s e
g d analytics systems Volume
r vised
fi e Superning Machine Velocity
s si lear Var
c la e learning Big Data Characteristics tim iety
n in e-
s io a ch Analytics of big data Ver series, (i tabular
s acit mage , text
re rm y , gr ,
reg c to Basic statistical
a ph)
ic ve
gi st o rt analysis
Lo pp Graph Applications: AI
ChatGPT
Su Alph
on
es ar

analytics with big data aGo


sit va nt

Al
gr ne
si

D)

p
Re Au ha
ce
re Li

n( e

Fo
SV

c to
io lu
es

om no ld

Fa
Web sea
2
d

cia
m ou
nt

po ar

en

lr
sd
ie

ec
de
m ul

riv
ad

og
rs in
co ng

tri x

n
Gr

ys

iti
rc
de Si

on
te

h
ma

m C n
Reco
n

te
ommnt-base
actio
y

end d
enc

Co
nk

a ti o
inter
ork

lla n
Pag
a
jac

Factoriza
fil bo
eR
netw

eRa
-item

te ra
Ad

(SVD)
ri ti
Pag

nk

ng ve
User

tion
Map
Red
uce

New Jersey Institute of Technology


PageRank Motivation: How to organize the Web?

§ First try: Human curated Web directories


§ Yahoo, DMOZ, LookSmart
§ Second try: Web search
§ Information Retrieval investigates:
Finding relevant docs in a small
and trusted set
§ Newspaper articles, Patents, etc.
§ But: Web is huge, full of untrusted documents, random
things, web spam, etc.

COMP4434 3
New Jersey Institute of Technology
Challenges in Web Search

§ (1) Web contains many sources of information


Who to “trust”?
§ Trick: Trustworthy pages may point to each other!
§ (2) What is the “best” answer to query “newspaper”?
§ No single right answer
§ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

COMP4434 4
New Jersey Institute of Technology
Hint: Web as a Directed Graph

§ Nodes: Webpages
§ Edges: Hyperlinks

I teach Big
Data in
COMP
COMP is in
Faculty of
Engineering The Hong
Kong
Polytechnic
University

COMP4434 5
New Jersey Institute of Technology
Web as a Directed Graph

COMP4434 6
New Jersey Institute of Technology
Ranking Nodes on the Graph

§ All web pages are not equally “important”


https://xhuang31.github.io vs.
https://www.polyu.edu.hk

§ There is large diversity


in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

COMP4434 7
New Jersey Institute of Technology
Example of Node Ranking

§ Page Ranking
§ Social Ranking
§ Paper Ranking
§ Scholar Ranking
§ ……

COMP4434 8
New Jersey Institute of Technology
Idea: Links as votes

§ Page is more important if it has more links


§ In-coming links? Out-going links?
§ Think of in-links as votes:
§ www.stanford.edu has 23,400 in-links
§ https://xhuang31.github.io has 0 in-link

§ Are all in-links are equal?


§ Links from important pages count more
§ Recursive question!

COMP4434 9
New Jersey Institute of Technology
Google PageRank

§ In-coming links! Out-going links?


§ A page with high PageRank value
§ Many pages pointing to it, or
§ There are some pages that point
to it and have high PageRank
values
§ Example:
§ Page C has a higher PageRank
than Page E, even though it has
fewer links to it
§ The link it has is of a much higher
value

COMP4434 10
New Jersey Institute of Technology
Is Page == “Webpage”?

§ Born in March 26, 1973


§ Found Google at September 4, 1998
§ As of Nov 2024, own an estimated
net worth of $163 billion (No.15
Richest)
§ Begins from ”Larry Page and Sergey
Brin developed PageRank at Stanford
University in 1996” as part of a research
project about a new kind of search
Larry Page
engine. Co-founder of Google

COMP4434 11
New Jersey Institute of Technology
Simple Recursive Formulation

§ Each link’s vote is proportional to the importance of its source


page
§ If page j with importance PR(j) has n out-links, each link gets
PR(j) / n votes
§ Page j’s own importance is the sum of the votes on its in-links

i k
ri/3
rk/4

PR(j) = PR(i)/3+PR(k)/4 j rj/3

rj/3 rj/3

COMP4434 12
New Jersey Institute of Technology
How to Represent a Graph

§ Graph model 𝐺 = 𝑉, 𝐸
§ 𝑉 is a set of pages
§ 𝐸 is a set of edges
§ Each edge 𝑢, 𝑣 ∈ 𝐸 represents that
page 𝑢 points/references to page 𝑣
§ Adjacent List
§ A data structure for a graph
§ 𝐴𝑑𝑗 𝑢 = 𝑣: 𝑢, 𝑣 ∈ 𝐸 contains
each vertex 𝑣 being adjacent to 𝑢
§ Example: 𝐴𝑑𝑗 2 = {3, 4}

COMP4434 13
New Jersey Institute of Technology
PageRank: The “Flow” Model

§ A “vote” from an important page is y/2


worth more
y
§ A page is important if it is pointed to by
other important pages a/2
§ Define a “rank” rj for page j y/2
m
a m
a/2
ri
rj = å “Flow” equations:

i® j di
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of node 𝒊
rm = ra /2

COMP4434 14
New Jersey Institute of Technology
Solving the Flow Equations
Flow equations:
ry = ry /2 + ra /2
§ 3 equations, 3 unknowns,
no constants ra = ry /2 + rm
rm = ra /2
§ No unique solution
§ All solutions equivalent modulo the scale factor
§ Additional constraint forces uniqueness:
§ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
§ Solution: 𝒓𝒚 = , 𝒓𝒂 = , 𝒓𝒎 =
𝟓 𝟓 𝟓
§ But, we need a better method for large web-size graphs

COMP4434 15
New Jersey Institute of Technology
PageRank: Matrix Formulation

§ Stochastic adjacency matrix 𝑴


§ Let page 𝑖 has 𝑑𝑖 out-links
!
§ If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = else 𝑀𝑗𝑖 = 0
"!
§ 𝑴 is a column stochastic matrix
§ Columns sum to 1
§ Rank vector 𝒓: vector with an entry per page
ri
§ 𝑟𝑖 is the importance score of page 𝑖
§ ∑' 𝑟' = 1
rj = å
i® j di
§ The flow equations can be written
𝒓 = 𝑴⋅ 𝒓
COMP4434 16
New Jersey Institute of Technology
Example i k
ri/3
rk/4

§ Remember the flow equation: j rj/3

§ Flow equation in the matrix form rj/3 rj/3


𝑴⋅ 𝒓=𝒓
§ Suppose page i links to 3 pages, including j
i

j rj
. =
ri
ri
1/3 rj = å
i® j di
M . r = r
COMP4434 17
New Jersey Institute of Technology
Example: Flow Equations & M

y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M·r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m

COMP4434 18
New Jersey Institute of Technology
Power Iteration Method

§ Given a web graph with N nodes, where the nodes are pages
and edges are hyperlinks
§ Power iteration: a simple iterative scheme
§ Suppose there are N web pages
§ Initialize: r(0) = [1/N,….,1/N]T (t )
r
å
( t +1)
§ Iterate: r(t+1) = M · r(t) rj = i

§ Stop when |r(t+1) – r(t)|1 < e i® j di


di …. out-degree of node i

|x|1 = å1≤i≤N|xi| is the L1 norm


Can use any other vector norm, e.g., Euclidean

COMP4434 19
New Jersey Institute of Technology
PageRank: How to solve?
§ Power Iteration: y a m
y ½ ½ 0
§ Set 𝑟( = 1/N
y a ½ 0 1
*
§ 1: 𝑟′( = ∑'→( # m 0 ½ 0
+#
§ 2: 𝑟 = 𝑟′ a m
§ Go to 1 ry = ry /2 + ra /2
ra = ry /2 + rm
§ Example: rm = ra /2

ry 1/3 1/3 5/12 9/24 6/15


ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15

Iteration 0, 1, 2, …
COMP4434 20
New Jersey Institute of Technology
Why Power Iteration works? (1) Details!

§ Power iteration:
A method for finding dominant eigenvector (the vector
corresponding to the largest eigenvalue)
§ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
§ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏 = 𝑴 𝑴𝒓 𝟎 = 𝑴𝟐 ⋅ 𝒓 𝟎

§ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐 = 𝑴 𝑴𝟐 𝒓 𝟎 = 𝑴𝟑 ⋅ 𝒓 𝟎
§ Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the
dominant eigenvector of 𝑴 (𝑴 is stochastic/Markov matrix)
§ NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
𝑴𝒙 = 𝝀𝒙
Optimal r is the first or principal eigenvector of M, with
corresponding eigenvalue 1

COMP4434 21
New Jersey Institute of Technology
Why Power Iteration works? (2) Details!

§ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches


the dominant eigenvector of 𝑴 NOTE: x is an eigenvector with the
corresponding eigenvalue λ if:
𝑴𝒙 = 𝝀𝒙
§ Proof:
§ Assume M has n linearly independent eigenvectors,
𝑥1, 𝑥2, … , 𝑥3 with corresponding eigenvalues 𝜆1, 𝜆2, … , 𝜆3 ,
where 𝜆1 > 𝜆2 > ⋯ > 𝜆3
§ Vectors 𝑥1, 𝑥2, … , 𝑥3 form a basis and thus we can write:
𝑟 (4) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐3 𝑥3
§ 𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 + ⋯ + 𝒄𝒏 𝒙𝒏
= 𝑐1(𝑀𝑥1) + 𝑐2(𝑀𝑥2) + ⋯ + 𝑐3 (𝑀𝑥3 )
= 𝑐1(𝜆1𝑥1) + 𝑐2(𝜆2𝑥2) + ⋯ + 𝑐3 (𝜆3 𝑥3 )
§ Repeated multiplication on both sides produces
𝑀6 𝑟 (4) = 𝑐1(𝜆16 𝑥1) + 𝑐2(𝜆62 𝑥2) + ⋯ + 𝑐3 (𝜆63 𝑥3 )
COMP4434 22
New Jersey Institute of Technology
Why Power Iteration works? (3) Details!

§ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the


dominant eigenvector of 𝑴
§ Proof (continued):
§ Repeated multiplication on both sides produces
𝑀% 𝑟 (') = 𝑐) (𝜆)% 𝑥) ) + 𝑐* (𝜆%* 𝑥* ) + ⋯ + 𝑐+ (𝜆%+ 𝑥+ )
,! % ,# %
§ 𝑀% 𝑟 (') = 𝜆)% 𝑐) 𝑥) + 𝑐* 𝑥* + ⋯ + 𝑐+ 𝑥+
," ,"
,! ,$
§ Since 𝜆) > 𝜆* then fractions , … <1
," ,"
,% %
and so = 0 as 𝑘 → ∞ (for all 𝑖 = 2 … 𝑛).
,"

§ Thus: 𝑴𝒌 𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝒌𝟏 𝒙𝟏


§ Note if 𝑐( = 0 then the method won’t converge
§ The largest eigenvalue of a stochastic matrix is always 1.
COMP4434 23
New Jersey Institute of Technology
PageRank: Three Questions

(t )
ri

( t +1)
rj
i® j di
or
equivalently
r = Mr
§ Does this converge?

§ Does it converge to what we want?

§ Are results reasonable?

COMP4434 24
New Jersey Institute of Technology
Does this converge?

(t )
ri

( t +1)
a b rj
i® j di
§ Example:
ra 1 0 1 0
=
rb 0 1 0 1

Iteration 0, 1, 2, …

COMP4434 25
New Jersey Institute of Technology
Does it converge to what we want?

(t )
ri

( t +1)
a b rj
i® j di
§ Example:
ra 1 0 0 0
rb = 0 1 0 0

Iteration 0, 1, 2, …

COMP4434 26
New Jersey Institute of Technology
PageRank: Problems

2 problems: Dead end

§ (1) Some pages are


dead ends (have no out-links)
§ “Vote” has “nowhere” to go to
§ Such pages cause importance to “leak out”
Spider
trap
§ (2) Spider traps:
(all out-links are within the group)
§ “Vote” gets “stuck” in a trap
§ And eventually spider traps absorb all importance

COMP4434 27
New Jersey Institute of Technology
Problem: Spider Traps

§ Power Iteration: y a m
y
y ½ ½ 0
§ Set 𝑟( = 1
a ½ 0 0
*
§ 𝑟( = ∑'→( # a m m 0 ½ 1
+#
§ And iterate m is a spider trap
ry = ry /2 + ra /2
ra = ry /2
§ Example: rm = ra /2 + rm
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …

All the PageRank score gets “trapped” in node m.


28
New Jersey Institute of Technology
Solution: Teleports!

§ The Google solution for spider traps: At each time step, the
“vote” has two options
§ With prob. b, follow a link at random
§ With prob. 1-b, jump to some random page
§ Common values for b are in the range 0.8 to 0.9
§ “Vote” will teleport out of spider trap
within a few time steps

y y

a m a m

COMP4434 29
New Jersey Institute of Technology
Problem: Dead Ends

§ Power Iteration: y a m
y
y ½ ½ 0
§ Set 𝑟( = 1
a ½ 0 0
*
§ 𝑟( = ∑'→( # a m m 0 ½ 0
+#
§ And iterate
ry = ry /2 + ra /2
ra = ry /2
§ Example: rm = ra /2
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic. 30
New Jersey Institute of Technology
Solution: Always Teleport!

§ Teleports: Follow random teleport links with probability 1.0


from dead-ends
§ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓

COMP4434 31
New Jersey Institute of Technology
Why Teleports Solve the Problem?

§ Spider-traps are not a problem, but with traps PageRank


scores are not what we want
§ Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
§ Dead-ends are a problem
§ The matrix is not column stochastic so our initial
assumptions are not met
§ Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go

COMP4434 32
New Jersey Institute of Technology
Solution: Random Teleports
§ Google’s solution that does it all:
At each step, random surfer has two options:
§ With probability b, follow a link at random
§ With probability 1-b, jump to some random page

§ PageRank equation [Larry Page and Sergey Brin 1998]


𝑟" 1
𝑟! = ' 𝛽 + (1 − 𝛽) di … out-degree
𝑑" 𝑁 of node i

"→!
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.

COMP4434 33
New Jersey Institute of Technology
The Google Matrix

§ PageRank equation [Brin-Page, ‘98]


𝑟' 1
𝑟( = D 𝛽 + (1 − 𝛽)
𝑑' 𝑁
'→(
§ The Google Matrix A:
1 [1/N]NxN…N by N matrix
𝐴 =𝛽𝑀+ 1−𝛽 where all entries are 1/N
𝑁 7×7
§ We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓
And the Power method still works!
§ What is b ?
§ In practice b =0.8 ~ 0.9 (make 5 steps on avg., jump)

COMP4434 34
New Jersey Institute of Technology
Random Teleports (b = 0.8)

M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1

15
5
7/1

1/

y 7/15 7/15 1/15


15

13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15
A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
COMP4434 35
New Jersey Institute of Technology
MapReduce Program for PageRank
Map(key, value) {
// key: a page,
// value: page rank of the page

For each page in Adj[key]


emit(page, PR(key)/sizeof(Adj[key]);
}

Reduce(key, values) {
// key: a page,
// values: a list of page ranks from all its incoming pages

PR(key)=1-b;
For each pagerank in values
PR(key) = PR(key) + b*pagerank;
emit(key, PR(key));
}

COMP4434 36
New Jersey Institute of Technology
MapReduce Program for PageRank

ABCD A1 B 1/3
A 1/2
BAD B1 C 1/3
Map Reduce A 1/2
CAB C1 D 1/3
DBC D1
A 1/2 B 1/3
Links.txt Initial PR D 1/2 B 1/2
B 1/2

A 1/2
C 1/3
B 1/2
C 1/2

B 1/2
C 1/2 D 1/3
D 1/2

COMP4434 37
New Jersey Institute of Technology
Web Search Engines

§ Indexer
§ Process the retrieved
pages/documents and
represents them in
efficient search data
structures (inverted files)

§ Query Server
§ Accept the query from the
user and return the result
pages by consulting the
search data structure

COMP4434 38
New Jersey Institute of Technology

You might also like