Page Rank, Structure of Web and Analyzing A Web Graph
Page Rank, Structure of Web and Analyzing A Web Graph
PageRank is one of the algorithms by Google that determines the relative importance of a page within
the World Wide Web.
Web Structure
● The web is a network consisting of various web pages and resources
● Page U has a set of associated hyperlinks within it called anchors which are
denoted by vector u1 , u2 , u3 etc
● Vertices going from one page to the other are called out-edges
Hypotheses
Number of metrics analyze a system using web graph mining. Following are the
examples:
Normalization constant denotes by nc, such that PR of all pages sums equal to 1.
However, just measuring the in-degree does not account for the authority of the source of a link.
When Pgv in-links to a page Pgu, its rank increases and when page Pgu out-links to other new
links, it means that N (Pgu) increases, then rank PR(Pgv) sinks (decreases). Eventually, the PR
(Pgv) converges to a value.
2) PageRank algorithm using the relative authority of the parents over linked children
A method of PageRank considers the entire web in place of local neighbourhood of the pages and
considers the relative authority of the parents (children).
Page rank is proportional to the weight of the parent and inversely proportional to the out-links of the
parent.
Assume that
(i) Page v (Pgv) has in-links with parent Page u (Pgu) and other pages in set PA (v) of parent pages
to v that means I PA(v)
where nc = [1/R(v)]. R(v) is iterated and computed for each parent in the set PA(v) till new value of
R(v) does not change within the defined margin, say 0.001 in the succeeding iterations.
Significance of n PageRank can be seen as modeling a “random surfer” that starts on a random page
and then at each point: E(v) models the probability that a random link jumps (surfs) and connect
without-link to Pgv.
R(v) models the probability that the random link connects (surf) to Pgv at any given time. The addition
of E(v) solves the problem of Pgv by chance out-linking to a link with dead end (no outgoing links).
Web Communities
Topic-sensitive Page Rank and Link spam
● Sometimes, when you're reading the internet book, you might randomly jump
to a different page. The probability of this happening is represented by a letter
"E." Now, when it comes to topics, the importance of a page can be different.
Higher values of "a" mean higher importance for a topic.
4. Computing Page Ranks for Different Topics:
● Each page in the internet book might be important for different topics. So, we
compute the importance of a page for each topic separately. If the book has
topics like animals (t1), colors (t2), and so on, we calculate how important
each page is for each of these topics using a non-uniform personalization
vector. This vector helps adjust the importance of pages based on the topic.
1. Link Spam and Topic-Sensitive PageRank:
● Imagine the internet as a big community of pages, like a city. Some tricky
websites try to cheat a system called PageRank, which decides how important
a page is. This cheating is called link spam. It's like someone trying to make
their page seem super important by connecting to it a lot.
● The spammy website (let's call it "ws") wants to boost its importance, so it
connects to another page ("ls") many times. Additionally, ls has its own helper
pages ("als") that only link to ls. This makes ls look more important than it
really is. The whole group of ws, ls, and als is called a spam mass.
3. Nullifying Link Spam Effect:
● Smart people have figured out a way to stop this cheating. They use something
called the topic-sensitive PageRank algorithm. It's like a superhero for the
internet. This algorithm introduces a trust rank (like a measure of how
trustworthy a page is) in its calculations. It tracks the spammy group and stops
them from tricking the system.
● To catch these tricky pages, we look for patterns. If a page has way too many
connections compared to others on the same topic, it's suspicious. We draw a
special graph called a power-law plot to see these patterns. If the graph doesn't
look normal and has a strange pattern, it's like a signal that link spam is
happening.