0% found this document useful (0 votes)

32 views

4.link Analysis and Page Rank S4

1. PageRank is an algorithm used by Google to rank web pages based on the link structure of the web. It simulates a random web surfer that randomly clicks on links and ranks pages based on the probability that the random surfer would land on that page. 2. PageRank works by treating links as recommendations and values pages highly if they receive many recommendations from other important pages. It solves the problem of "term spam" used by early search engines. 3. PageRank models the web as a directed graph with pages as nodes and links as edges, then uses the link structure to determine the importance of pages through an eigenvector calculation of the stochastic matrix of the web.

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

4.link Analysis and Page Rank S4

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

BIG DATA ANALYTICS

Link Analysis and PageRank

Problem: efficient Web search

• The availability of efficient and accurate Web search

• Google: the first able to defeat the spammers
• The innovation provided by Google: ”PageRank”
Solutions
• PageRank: an essential technique for a search engine
• Spammers invented ways to manipulate the PageRank
• =⇒ TrustRank (and other techniques) for preventing
spammers from attacking PageRank
PageRank
• Larry Page, the co-inventor and a co-founder of Google
• PageRank: a tool for evaluating the importance of Web pages
• Ideas: ”random surfers” and ”taxation”
Early Search Engines

Before Google:

• Crawling the Web

• listing the terms found in each page in an inverted index
• makes it easy, given a term, to find all the places where that
term occurs
• A search query is issued:
• pages with those terms extracted from the inverted index
• Ranked reflecting the use of the terms within the page:

• presence of a term in a header

• large numbers of occurrences

Term Spam
• How to fool search engines?
• E.g., you were selling shirts on the Web
• Add a term ”movie” to your page thousands of times
• Give it the same color as the background
• When a user issued a search query with the term ”movie”, the
search engine would list your page first
• If simply adding “movie” to your page didn’t do the trick:
• give the query ”movie”

• copy the page that come back as the first choice into your own

• use the background color to make it invisible.

• Term spam made early search engines almost useless...

Graph Data: Social Networks

Facebook social graph, [Backstrom-Boldi-Rosa-Ugander-Vigna,

2011]
Graph Data: Media Networks

Connections between political blogs, [Adamic-Glance, 2005]

Graph Data: Information Nets

Citation networks and Maps of science, [Börner et al., 2012]

Web as a Graph

• Web is a directed graph

• Nodes: Webpages
• Edges: Hyperlinks
Web search: challenges

• Web contains many sources of information. Who to

trust?
• Trick: trustworthy pages may point to each other!

• What is the best answer to query ”newspaper”?

• No single right answer
• Trick: pages that actually know about newspapers might all be
pointing to many newspapers
Ranking Nodes on the Graph
• All web pages are not equally ”important”
• There is large diversity in the web-graph node connectivity
• Let’s rank the pages by the link structure!
Links as Votes

• Idea: links are votes

• Page is more important if it has more links
• Are all in-links equal?
• Links from important pages count more
PageRank Scores
Recursive Formulation

• Each link’s vote is proportional to the importance of its source

page
• If page j with importance rj has n out-links, each link gets
rj /n votes
• Page j’s own importance is the sum of the votes on its in-links
The ”flow” model

• A ”vote” from an important page is worth more

• A page is important if it is pointed to by other important
pages
• The rank rj of page j:
X ri
rj =
di
i→j

di out-degree of node i
Solving the flow equations

X ri
rj =
di
i→j

• No unique solution
• Additional constraint forces uniqueness:
P
ri = 1
• Gaussian elimination method works for small examples
• A better method for large web-size graphs?
Matrix Formulation
• Stochastic adjacency matrix M :
• Let page i has di out-links
1
• If i → j, then Mji = else Mji = 0
di
• M is a column stochastic matrix

• Columns sum to 1

• Rank vector r: vector with an entry per page

• ri is the importance score of page i
•
P
i ri = 1

• The flow equation:

r =M ·r
Eigenvector Formulation

• The flow equation:

r =M ·r
• The rank vector r is an eigenvector of the stochastic matrix M
• We can now efficiently solve it!
• Power iteration
Random Walk Interpretation

• Imagine a random web surfer:

• At any time t, surfer is on some page i
• At time t + 1, the surfer follows an out-link from i uniformly at
random
• Ends up on some page j linked from i
• Process repeats indefinitely
The stationary Distribution

• Where is the surfer at time t + 1?

• Follows a link:
p(t + 1) = M p(t)
• Suppose the random walk reaches a state

p(t + 1) = M p(t) = p(t)

then p(t) is stationary distribution of random walk

• our original rank vector r satisfies r = M r
• So, r is a stationary distribution for the random walk
PageRanking: three questions

r = Mr

• Does this converge?

• Does it converge to what we want?
• Are results reasonable?
Existence and Uniqueness

For graphs that satisfy certain conditions, the stationary

distribution is unique and eventually will be reached no
matter what the initial probability distribution at time t = 0
PageRank

PageRank simulates where Web surfers, starting at a random

page, would tend to congregate if they followed randomly
chosen outlinks

• Pages with a large number of surfers considered more

”important”
• Google prefers important pages to unimportant pages
Simplified PageRank?

• Computing PageRank by simulating random surfers is a

time-consuming process...
• Simply counting the number of in-links for each page ??

• ”Spam farm” of a million pages, each of which linked to his

shirt page
• =⇒ the shirt page looks very important...
Why does it work?

• Hard to fool Google

• E.g., the shirt-seller can still add ”movie” to his page
• Google believed what other pages say about him
• Create many pages of his own, and link to his shirt- selling
page ??
• Those pages would not be given much importance by
PageRank...
Structure of the Web

”Analysis of the Greek Web-space”[T. Mchedlidze et al, ]

Structure of the Web

”Analysis of the Greek Web-space”[T. Mchedlidze et al, ]

PageRank: problems
• Dead ends (have no out-links):
• Random walk has “nowhere” to go to
• Such pages cause importance to “leak out”
• (2) Spider traps: (all out-links are within the group):
• Random walked gets “stuck” in a trap
• Spider traps absorb all importance...

J. Leskovec, et al: Mining of Massive Datasets.

Solution: teleports!

• The Google solution for spider traps: At each time step, the
random surfer has two options:
1. with probability β follow a link at random
2. with probability 1 − β jump to some random page
3. Common values for β are in the range 0.8 to 0.9
• Surfer will teleport out of spider trap or a dead end
within a few time steps
Using PageRank in a Search Engine

• A secret formula that decides the order in which to show

pages to the user
• Google: over 250 different properties of pages
• A page has to have at least one of the search terms in the
query
• Normally, unless all the search terms are present, a page has
very little chance of being in the top ten
• A score is computed for each qualified page
• An important component: the PageRank of the page
• Other components: the presence or absence of search terms in
prominent places
• ...
References

• J. Leskovec, A. Rajaraman and J. D. Ullman Mining of

Massive Datasets (2014), Chapter 5

• S. Brin and L. Page, ”Anatomy of a large-scale hypertextual

web search engine”, Proc. 7th Intl. World-Wide-Web
Conference, pp. 107 - 117, 1998.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6434)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (641)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (996)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1853)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Missing Part Cards PDF
100% (14)
Missing Part Cards PDF
38 pages
CMPE 250 Project #1 - Simplified Turkish Checkers Game (DAMA)
No ratings yet
CMPE 250 Project #1 - Simplified Turkish Checkers Game (DAMA)
5 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Main EL CM2end 2023
No ratings yet
Main EL CM2end 2023
33 pages
1 MinHash-1
No ratings yet
1 MinHash-1
4 pages
5.Topic-Sensitive PageRank S5
No ratings yet
5.Topic-Sensitive PageRank S5
11 pages
Lecturenotes 3
No ratings yet
Lecturenotes 3
11 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Acct Bazaar
No ratings yet
Acct Bazaar
1 page
Oracle Work in Process (WIP) - Release 11.5
No ratings yet
Oracle Work in Process (WIP) - Release 11.5
23 pages
ITC (Fall-19) Mid Exam Solution PDF
No ratings yet
ITC (Fall-19) Mid Exam Solution PDF
4 pages
7.3.2.7 Lab - Testing Network Connectivity With Ping and Traceroute
100% (6)
7.3.2.7 Lab - Testing Network Connectivity With Ping and Traceroute
16 pages
Eapp Apa Referencing Table PDF
No ratings yet
Eapp Apa Referencing Table PDF
3 pages
How to Use Discourse _ Alignerr Onboarding _ Alignerr Onboarding Academy
No ratings yet
How to Use Discourse _ Alignerr Onboarding _ Alignerr Onboarding Academy
1 page
Ts 125321v100400p PDF
No ratings yet
Ts 125321v100400p PDF
202 pages
Digital Media Essay
No ratings yet
Digital Media Essay
11 pages
Ict-7 - Characteristics of Entrepreneur
100% (1)
Ict-7 - Characteristics of Entrepreneur
46 pages
Siegecast Cobalt Strike Basics
100% (1)
Siegecast Cobalt Strike Basics
53 pages
Airtel
No ratings yet
Airtel
2 pages
MPLS Label Distribution Protocol, LDP - Part 2 - WWW - Ipcisco
No ratings yet
MPLS Label Distribution Protocol, LDP - Part 2 - WWW - Ipcisco
7 pages
Pasolink Plus
No ratings yet
Pasolink Plus
316 pages
Salils1703830 Disserationv2
100% (1)
Salils1703830 Disserationv2
68 pages
Introduction To Containers, Kubernetes, and Red Hat OpenShift (DO180)
No ratings yet
Introduction To Containers, Kubernetes, and Red Hat OpenShift (DO180)
5 pages
Computer Networking - This Book Includes - Computer Networking For Beginners and Beginners Guide (All in One)
No ratings yet
Computer Networking - This Book Includes - Computer Networking For Beginners and Beginners Guide (All in One)
203 pages
MM Notes PDF
50% (2)
MM Notes PDF
800 pages
Using Nagios For Intrusion Detection
No ratings yet
Using Nagios For Intrusion Detection
4 pages
Hashtag Love Story
No ratings yet
Hashtag Love Story
12 pages
Web Services Essentials
No ratings yet
Web Services Essentials
8 pages
Our Self-Harm Workbook: Kati Morton
100% (5)
Our Self-Harm Workbook: Kati Morton
26 pages
Ulbricht Sentencing Defense Letter
No ratings yet
Ulbricht Sentencing Defense Letter
13 pages
Wardriving
No ratings yet
Wardriving
24 pages
Evolution of Knowledge Rethinking Science for the Anthropocene The download pdf
100% (3)
Evolution of Knowledge Rethinking Science for the Anthropocene The download pdf
24 pages
Strategic Audit of Ebay, Inc.
No ratings yet
Strategic Audit of Ebay, Inc.
14 pages
Remote 5G Modem
No ratings yet
Remote 5G Modem
3 pages
Setup Local APT Repository in Debian 8
No ratings yet
Setup Local APT Repository in Debian 8
7 pages
Windwos CMD Commands
No ratings yet
Windwos CMD Commands
2 pages

4.link Analysis and Page Rank S4

Uploaded by

4.link Analysis and Page Rank S4

Uploaded by

BIG DATA ANALYTICS

Link Analysis and PageRank

• The availability of efficient and accurate Web search

• Crawling the Web

• presence of a term in a header

• large numbers of occurrences

• use the background color to make it invisible.

• Term spam made early search engines almost useless...

Facebook social graph, [Backstrom-Boldi-Rosa-Ugander-Vigna,

Connections between political blogs, [Adamic-Glance, 2005]

Citation networks and Maps of science, [Börner et al., 2012]

• Web is a directed graph

• Web contains many sources of information. Who to

• What is the best answer to query ”newspaper”?

• Idea: links are votes

• Each link’s vote is proportional to the importance of its source

• A ”vote” from an important page is worth more

• Rank vector r: vector with an entry per page

• The flow equation:

• The flow equation:

• Imagine a random web surfer:

• Where is the surfer at time t + 1?

p(t + 1) = M p(t) = p(t)

then p(t) is stationary distribution of random walk

• Does this converge?

For graphs that satisfy certain conditions, the stationary

PageRank simulates where Web surfers, starting at a random

• Pages with a large number of surfers considered more

• Computing PageRank by simulating random surfers is a

• ”Spam farm” of a million pages, each of which linked to his

• Hard to fool Google

”Analysis of the Greek Web-space”[T. Mchedlidze et al, ]

”Analysis of the Greek Web-space”[T. Mchedlidze et al, ]

J. Leskovec, et al: Mining of Massive Datasets.

• A secret formula that decides the order in which to show

• J. Leskovec, A. Rajaraman and J. D. Ullman Mining of

• S. Brin and L. Page, ”Anatomy of a large-scale hypertextual

You might also like