0% found this document useful (0 votes)
36 views

Page Rank, Structure of Web and Analyzing A Web Graph

1. PageRank is an algorithm used by Google to determine the importance of web pages based on their inbound links. 2. The web can be represented as a graph with pages as vertices and hyperlinks as edges. PageRank assigns each page a numerical ranking based on the pages that link to it and the PageRanks of those pages. 3. Topic-sensitive PageRank computes different PageRanks for individual topics by introducing a bias or preference toward pages relevant to that topic. This helps combat link spam by reducing the impact of spammy, less trustworthy pages.

Uploaded by

First Last
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Page Rank, Structure of Web and Analyzing A Web Graph

1. PageRank is an algorithm used by Google to determine the importance of web pages based on their inbound links. 2. The web can be represented as a graph with pages as vertices and hyperlinks as edges. PageRank assigns each page a numerical ranking based on the pages that link to it and the PageRanks of those pages. 3. Topic-sensitive PageRank computes different PageRanks for individual topics by introducing a bias or preference toward pages relevant to that topic. This helps combat link spam by reducing the impact of spammy, less trustworthy pages.

Uploaded by

First Last
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Page Rank, Structure of

Web and Analyzing a Web


Graph
Page Rank

PageRank is one of the algorithms by Google that determines the relative importance of a page within
the World Wide Web.
Web Structure
● The web is a network consisting of various web pages and resources

● The network can be considered as a set of vertices and edges


● We can assume U and V to be 2 web pages. Here they will act as vertices.

● Page U has a set of associated hyperlinks within it called anchors which are
denoted by vector u1 , u2 , u3 etc

● Similarly, page V has a set of associated anchors, denoted by vector v1 , v2 , v3


etc
● N and M denote the number of hyperlinks at Page U and Page V respectively

● Vertices going from one page to the other are called out-edges
Hypotheses

1. Text at the hyperlink represents the property of a vertex u that


describes the destination V of the outgoing edge.

2. A hyperlink in-between the pages represents the conferring of the


authority.
Analyzing and Implementing a System with Web Graph
Mining

Number of metrics analyze a system using web graph mining. Following are the
examples:

● In-degrees and out-degrees.


● Closeness is centrality metric. Closeness, Cc(v) = I/ I',gdist(v,u), where gdist
is the geodesic distance between vertex v with u and .v sum is over all u
linked with V. Geodesic distance means the number of edges in a shortest path
connecting two vertices. Assume v has an edge with w, and w has an edge
with u. Assume u does not have direct edge from v. Then, geodesic distance=
2 (two edges between v and u in shortest path).
● Betweenness.
● PageRank and LineRank.
● Hubs and authorities.
● Communities parameters, triangle count, clustering coefficient, K-
neighbourhood.
● Top K-shortest paths.
Computation of PageRank and PageRank Iteration

1) PageRank algorithm using the in-degrees as conferring authority


Assume that the page U, when out-linking to Page V “considers” an equal fraction of its authority
to all the pages it points to, such as Pgv where N(Pgu) is the total number of out-links from U.

Normalization constant denotes by nc, such that PR of all pages sums equal to 1.

However, just measuring the in-degree does not account for the authority of the source of a link.

When Pgv in-links to a page Pgu, its rank increases and when page Pgu out-links to other new
links, it means that N (Pgu) increases, then rank PR(Pgv) sinks (decreases). Eventually, the PR
(Pgv) converges to a value.
2) PageRank algorithm using the relative authority of the parents over linked children

A method of PageRank considers the entire web in place of local neighbourhood of the pages and
considers the relative authority of the parents (children).

Page rank is proportional to the weight of the parent and inversely proportional to the out-links of the
parent.

Assume that

(i) Page v (Pgv) has in-links with parent Page u (Pgu) and other pages in set PA (v) of parent pages
to v that means I PA(v)

(ii) R(v) is PageRank of Pgv

(iii) R (u) is weight (importance/rank) of Pgu, and

(iv) ch (u) is weight of child (out-links) of Pgu.

Then the following


An alternative equation is as follows:

where nc = [1/R(v)]. R(v) is iterated and computed for each parent in the set PA(v) till new value of
R(v) does not change within the defined margin, say 0.001 in the succeeding iterations.

Significance of n PageRank can be seen as modeling a “random surfer” that starts on a random page
and then at each point: E(v) models the probability that a random link jumps (surfs) and connect
without-link to Pgv.

R(v) models the probability that the random link connects (surf) to Pgv at any given time. The addition
of E(v) solves the problem of Pgv by chance out-linking to a link with dead end (no outgoing links).
Web Communities
Topic-sensitive Page Rank and Link spam

1. Topic-Sensitive Page Ranking:


● Imagine you have a giant internet book with many pages. Each page talks
about different things like animals, colors, or numbers. Now, some pages
might be more important for specific topics than others. For example, a page
about animals might be more important for the topic of animals then a page
about colors.
2. Biasing and Probability:

● There's a concept called biasing, which is like giving extra importance to


certain pages. This bias is introduced using a factor called "a." This factor is
multiplied by the actual importance of a page, making it more or less
important based on the bias.

3. Random Jumps and Personalization:

● Sometimes, when you're reading the internet book, you might randomly jump
to a different page. The probability of this happening is represented by a letter
"E." Now, when it comes to topics, the importance of a page can be different.
Higher values of "a" mean higher importance for a topic.
4. Computing Page Ranks for Different Topics:

● Each page in the internet book might be important for different topics. So, we
compute the importance of a page for each topic separately. If the book has
topics like animals (t1), colors (t2), and so on, we calculate how important
each page is for each of these topics using a non-uniform personalization
vector. This vector helps adjust the importance of pages based on the topic.
1. Link Spam and Topic-Sensitive PageRank:

● Imagine the internet as a big community of pages, like a city. Some tricky
websites try to cheat a system called PageRank, which decides how important
a page is. This cheating is called link spam. It's like someone trying to make
their page seem super important by connecting to it a lot.

2. How Link Spam Works:

● The spammy website (let's call it "ws") wants to boost its importance, so it
connects to another page ("ls") many times. Additionally, ls has its own helper
pages ("als") that only link to ls. This makes ls look more important than it
really is. The whole group of ws, ls, and als is called a spam mass.
3. Nullifying Link Spam Effect:

● Smart people have figured out a way to stop this cheating. They use something
called the topic-sensitive PageRank algorithm. It's like a superhero for the
internet. This algorithm introduces a trust rank (like a measure of how
trustworthy a page is) in its calculations. It tracks the spammy group and stops
them from tricking the system.

4. Finding Link Spam:

● To catch these tricky pages, we look for patterns. If a page has way too many
connections compared to others on the same topic, it's suspicious. We draw a
special graph called a power-law plot to see these patterns. If the graph doesn't
look normal and has a strange pattern, it's like a signal that link spam is
happening.

You might also like