SlideShare a Scribd company logo
Link Structure Analysis
Kira Radinsky
All of the following slides are courtesy of Ronny Lempel (Yahoo!)
29 November 2010 236620 Search Engine Technology 2
Link Analysis
In the Lecture
• HITS: topic-specific algorithm
– Assigns each page two scores – a hub score and an authority score –
with respect to a topic
• PageRank: query independent algorithm
– Assigns each page a single, global importance score
• Both algorithms reduced to the computation of principal eigenvectors
of certain matrices
Today’s Tutorial:
1. Graph modifications in link analysis algorithms
2. SALSA – HITS with a random-walk twist
3. Topic-Sensitive PageRank
30 November 2010 236620 Search Engine Technology 3
Graph Modifications in Link-Analysis
Algorithms
1. Delete irrelevant elements (pages, links) from the collection.
 Non-informative links
 Pages that are deemed irrelevant (mostly by similarity of
content to the query), and their incident links [Bharat
and Henzinger, 1998]
2. Assign varying (positive) link weights to the non-deleted
links.
– Similarity of anchor text to the query [CLEVER]
– Links incident to pre-defined relevant pages [CLEVER]
– Multiple links from pages of site A to pages of site B
[Bharat and Henzinger, 1998]
• Note that some of the above modifications are only
applicable to topic distillation algorithms
29 November 2010 236620 Search Engine Technology 4
SALSA – Stochastic Approach to Link
Structure Analysis
• SALSA, like HITS, is a topic-distillation algorithm that aims to
assign pages both hub and authority scores
– SALSA analyzes the same topic-centric graph as HITS, but splits each
node into two – a “hub personality” without in-links and an
“authority personality” without out-links
– Examines the resulting bipartite graph
• Innovation: incorporate stochastic analysis with the
authority-hub paradigm
– Examine two separate random walk Markov chains:
an authority chain A, and a hub chain H.
– A single step in each chain is composed of two link traversals on the
Web - one link forward, and one link backwards.
– The principal community of each type: the most frequently visited
pages in the corresponding Markov Chain
Forming bi-pirate graph in Salsa
29 November 2010 236620 Search Engine Technology 6
Pr (23) = 2/5*1/3
• Formally, The transition
probability matrix:
SALSA – Authority Chain Example
[PA]i,j =  {k| ki, kj} (iin)-1(kout)-1
29 November 2010 236620 Search Engine Technology 7
SALSA: Analysis
• The transition probabilities induce a probability
distribution on the authorities (hubs) in the authority
(hub) Markov chain
– If the chains are not irreducible, the probability depends on the
initial distribution (chosen to be uniform)
• The principal community of authorities (hubs) is defined
as the k most probable pages in the authority (hub) chain
• While one can compute the scores by calculating the
principal eigenvector of the stochastic transition matrices,
a more efficient way exists
29 November 2010 236620 Search Engine Technology 8
Mathematical Analysis of SALSA leads to the
following theorem: SALSA’s authority weights reflect
The normalized in-degree of each page,
multiplied by the relative size of the
page’s component in the authority side of the graph
x
3 4
a(x) = ----- x ----- = 0.25
3 +5 4 +2
SALSA: Analysis (cont.)
29 November 2010 236620 Search Engine Technology 9
SALSA: Proof for Irreducible Authority Chains
• The proof assumes a weighted graph, in which the link kj
has weight w(kj)
– The examples shown so far assumed that all links have a weight of
1
• Define W as the sum of all links weights
• Define a distribution vector π by πj = din(j)/W, where din(j)
is the sum of weights of j’s incoming links
– Similarly, dout(k) is the sum of weights of k’s outgoing links
• It is enough to prove that πPA=π, since PA has a single
stationary eigenvector (Ergodic Theorem)
– Recall that PA is the transition matrix of the authority chain
– PA is always aperiodic
29 November 2010 236620 Search Engine Technology 10
SALSA: Proof for Irreducible Authority Chains
29 November 2010 236620 Search Engine Technology 11
Topic Sensitive PageRank
[T. Haveliwala, 2002]
• A topic T is defined by a set of on-topic pages ST.
• A T-biased PageRank is PageRank where the random jumps
(teleportations) land u.a.r. on ST rather than on any arbitrary
Web page
• Recall the alternative interpretation of PageRank, as walking
random paths of geometrically distributed lengths between
resets
– Here, a reset returns to some on-topic page
• If we assume that pages tend to link to pages with topical
affinity, short paths starting at ST will not stray too far away
from on-topic pages, hence the PageRanks will be T-biased
– Note that pages unreachable from ST will receive a T-biased PageRank of 0
• Where would be a good place to find sets ST for certain
topics?
– The pages classified under the 16 top-level topics of the Open Directory
Project (see next slide)
29 November 2010 236620 Search Engine Technology 12
29 November 2010 236620 Search Engine Technology 13
Topic-Sensitive PageRank (cont.)
• 16 PageRank vectors are computed, PR1,…,PR16
• Given a query q, its affinity to the 16 topics T1,…,T16 is
computed
– Based on the probability of generating the query by the language
model induced by the set of pages ST
– A distribution vector [α1,…,α16] is computed, where
αj ~ Prob(q | language model of STj)
• The PageRank vector that will be used to serve q is
PRq = αjPRj
• The idea of biasing PageRank’s random jump destinations is
also used for personalized PageRank flavors [e.g. Jeh and
Widom 2003]
29 November 2010 236620 Search Engine Technology 14
Link Analysis Algorithms - Summary
• Many variants and refinements of both HITS and PageRank
have been proposed.
• Other approaches include:
– Max-flow techniques [Flake et al., SIGKDD 2000]
– Machine learning and Bayesian techniques
• Examples of applications:
– Ranking pages (topic specific/global importance/ personalized rankings)
– Categorization, clustering, finding related pages
– Identifying virtual communities
• Computational issues:
– Distributed computations of eigenvectors of massive, sparse matrices
– Convergence acceleration, approximations
• A wealth of literature
Ad

More Related Content

Similar to Tutorial 7 (link analysis) (20)

Pagerank
PagerankPagerank
Pagerank
Adrian
 
Pagerank (2)
Pagerank (2)Pagerank (2)
Pagerank (2)
kevin veliz
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
marlon
 
Pagerank (2)
Pagerank (2)Pagerank (2)
Pagerank (2)
kevin veliz
 
Pagerank
PagerankPagerank
Pagerank
daniel
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
marlon
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
marlon
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
Carlos Fernando
 
Pagerank
PagerankPagerank
Pagerank
nelson carabajo aguilera
 
prueba
prueba prueba
prueba
Jorge Baquero
 
Pagerank
PagerankPagerank
Pagerank
Ondina
 
Pagerank
PagerankPagerank
Pagerank
Diego
 
Pagerank
PagerankPagerank
Pagerank
Gabriel
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
Fabricio
 
Pagerank
PagerankPagerank
Pagerank
Felix
 
Pagerank
PagerankPagerank
Pagerank
Byron Zavala
 
Pagerank
PagerankPagerank
Pagerank
marco larco
 
Pagerank
PagerankPagerank
Pagerank
César García
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
Bismark
 
Power Point
Power PointPower Point
Power Point
Espol
 

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Ad

Recently uploaded (20)

Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Ad

Tutorial 7 (link analysis)

  • 1. Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
  • 2. 29 November 2010 236620 Search Engine Technology 2 Link Analysis In the Lecture • HITS: topic-specific algorithm – Assigns each page two scores – a hub score and an authority score – with respect to a topic • PageRank: query independent algorithm – Assigns each page a single, global importance score • Both algorithms reduced to the computation of principal eigenvectors of certain matrices Today’s Tutorial: 1. Graph modifications in link analysis algorithms 2. SALSA – HITS with a random-walk twist 3. Topic-Sensitive PageRank
  • 3. 30 November 2010 236620 Search Engine Technology 3 Graph Modifications in Link-Analysis Algorithms 1. Delete irrelevant elements (pages, links) from the collection.  Non-informative links  Pages that are deemed irrelevant (mostly by similarity of content to the query), and their incident links [Bharat and Henzinger, 1998] 2. Assign varying (positive) link weights to the non-deleted links. – Similarity of anchor text to the query [CLEVER] – Links incident to pre-defined relevant pages [CLEVER] – Multiple links from pages of site A to pages of site B [Bharat and Henzinger, 1998] • Note that some of the above modifications are only applicable to topic distillation algorithms
  • 4. 29 November 2010 236620 Search Engine Technology 4 SALSA – Stochastic Approach to Link Structure Analysis • SALSA, like HITS, is a topic-distillation algorithm that aims to assign pages both hub and authority scores – SALSA analyzes the same topic-centric graph as HITS, but splits each node into two – a “hub personality” without in-links and an “authority personality” without out-links – Examines the resulting bipartite graph • Innovation: incorporate stochastic analysis with the authority-hub paradigm – Examine two separate random walk Markov chains: an authority chain A, and a hub chain H. – A single step in each chain is composed of two link traversals on the Web - one link forward, and one link backwards. – The principal community of each type: the most frequently visited pages in the corresponding Markov Chain
  • 6. 29 November 2010 236620 Search Engine Technology 6 Pr (23) = 2/5*1/3 • Formally, The transition probability matrix: SALSA – Authority Chain Example [PA]i,j =  {k| ki, kj} (iin)-1(kout)-1
  • 7. 29 November 2010 236620 Search Engine Technology 7 SALSA: Analysis • The transition probabilities induce a probability distribution on the authorities (hubs) in the authority (hub) Markov chain – If the chains are not irreducible, the probability depends on the initial distribution (chosen to be uniform) • The principal community of authorities (hubs) is defined as the k most probable pages in the authority (hub) chain • While one can compute the scores by calculating the principal eigenvector of the stochastic transition matrices, a more efficient way exists
  • 8. 29 November 2010 236620 Search Engine Technology 8 Mathematical Analysis of SALSA leads to the following theorem: SALSA’s authority weights reflect The normalized in-degree of each page, multiplied by the relative size of the page’s component in the authority side of the graph x 3 4 a(x) = ----- x ----- = 0.25 3 +5 4 +2 SALSA: Analysis (cont.)
  • 9. 29 November 2010 236620 Search Engine Technology 9 SALSA: Proof for Irreducible Authority Chains • The proof assumes a weighted graph, in which the link kj has weight w(kj) – The examples shown so far assumed that all links have a weight of 1 • Define W as the sum of all links weights • Define a distribution vector π by πj = din(j)/W, where din(j) is the sum of weights of j’s incoming links – Similarly, dout(k) is the sum of weights of k’s outgoing links • It is enough to prove that πPA=π, since PA has a single stationary eigenvector (Ergodic Theorem) – Recall that PA is the transition matrix of the authority chain – PA is always aperiodic
  • 10. 29 November 2010 236620 Search Engine Technology 10 SALSA: Proof for Irreducible Authority Chains
  • 11. 29 November 2010 236620 Search Engine Technology 11 Topic Sensitive PageRank [T. Haveliwala, 2002] • A topic T is defined by a set of on-topic pages ST. • A T-biased PageRank is PageRank where the random jumps (teleportations) land u.a.r. on ST rather than on any arbitrary Web page • Recall the alternative interpretation of PageRank, as walking random paths of geometrically distributed lengths between resets – Here, a reset returns to some on-topic page • If we assume that pages tend to link to pages with topical affinity, short paths starting at ST will not stray too far away from on-topic pages, hence the PageRanks will be T-biased – Note that pages unreachable from ST will receive a T-biased PageRank of 0 • Where would be a good place to find sets ST for certain topics? – The pages classified under the 16 top-level topics of the Open Directory Project (see next slide)
  • 12. 29 November 2010 236620 Search Engine Technology 12
  • 13. 29 November 2010 236620 Search Engine Technology 13 Topic-Sensitive PageRank (cont.) • 16 PageRank vectors are computed, PR1,…,PR16 • Given a query q, its affinity to the 16 topics T1,…,T16 is computed – Based on the probability of generating the query by the language model induced by the set of pages ST – A distribution vector [α1,…,α16] is computed, where αj ~ Prob(q | language model of STj) • The PageRank vector that will be used to serve q is PRq = αjPRj • The idea of biasing PageRank’s random jump destinations is also used for personalized PageRank flavors [e.g. Jeh and Widom 2003]
  • 14. 29 November 2010 236620 Search Engine Technology 14 Link Analysis Algorithms - Summary • Many variants and refinements of both HITS and PageRank have been proposed. • Other approaches include: – Max-flow techniques [Flake et al., SIGKDD 2000] – Machine learning and Bayesian techniques • Examples of applications: – Ranking pages (topic specific/global importance/ personalized rankings) – Categorization, clustering, finding related pages – Identifying virtual communities • Computational issues: – Distributed computations of eigenvectors of massive, sparse matrices – Convergence acceleration, approximations • A wealth of literature