Structure of The Web:: Be M (MV) M V
Structure of The Web:: Be M (MV) M V
Problem 1: Find the limiting distribution of the above example by apply the Markovian process.
Problem 2: Find the limiting distribution of figure-3 by apply the Markovian process.
Problem-3: Remove all possible dead end nodes in the following graph and then find the
limiting distribution for the following graph.
Problem-4: How to find the page rank for the recursive dead ends?
(refer page number-173)
Problem-5: Find the page rank for the following graph using Markovian process
To avoid the problem illustrated by Example 5, we modify the calculation of PageRank by allowing
each random surfer a small probability of teleporting to a random page, rather than following an out-
link from their current page.
where is a chosen constant, usually in the range 0.8 to 0.9, e is a vector of all 1s with the
appropriate number of components, and n is the number of nodes in the Web graph. The term
Mv represents the case where, with probability , the random surfer decides to follow an
out-link from their present page. The term (1)e/n is a vector each of whose components has
value (1)/n and represents the introduction, with probability 1 , of a new random surfer
at a random page.
Example:
Example 5.6 : Let us see how the new approach to computing PageRank fares on the graph of Fig.
5.6. We shall use = 0.8 in this example. Thus, the equation for the iteration becomes
Notice that we have incorporated the factor into M by multiplying each of its elements by 4/5. The
components of the vector (1 )e/n are each 1/20, since 1 = 1/5 and n = 4. Here are the first
few iterations:
Problem: Compute the PageRank of each page in Figure-7, assuming no taxation.
Problem: Compute the PageRank of each page in Fig. 5.7, assuming = 0.8.
Problem: Suppose the Web consists of a clique (set of nodes with all possible arcs from one to
another) of n nodes and a single additional node that is the successor of each of the n nodes in the
clique. Figure 5.8 shows this graph for the case n = 4. Determine the PageRank of each page, as a
function of n and .
Problem:
Suppose we recursively eliminate dead ends from the graph, solve the remaining graph, and
estimate the PageRank for the dead-end pages as described above. Suppose the graph is a
chain of dead ends, headed by a node with a self-loop, as suggested in Fig.9. What would be
the Page-Rank assigned to each of the nodes?
The mathematical formulation for the iteration that yields topic-sensitive PageRank is
similar to the equation we used for general PageRank. The only difference is how we add the
new surfers. Suppose S is a set of integers consisting of the row/column numbers for the
pages we have identified as belonging to a certain topic (called the teleport set).Let eS be a
vector that has 1 in the components in S and 0 in other components. Then the topic-sensitive
Page-Rank for S is the limit of the iteration.
Here, as usual, M is the transition matrix of the Web, and |S| is the size of set S.
Example:-
Let us reconsider the original Web graph we used in Figure-1 which we reproduce as Figure-
15. Suppose we use = 0.8. Then the transition matrix for this graph, multiplied by , is
Suppose that our topic is represented by the teleport set S = {B,D}. Then the vector (1
)eS/|S| has 1/10 for its second and fourth components and 0 for the other two components.
The reason is that 1 = 1/5, the size of S is 2, and eS has 1 in the components for B and D
and 0 in the components for A and C. Thus, the equation that must be iterated is
Notice that because of the concentration of surfers at B and D, these nodes get a higher PageRank
than they did in Example 2.
A collection of pages whose purpose is to increase the PageRank of a certain page or pages is
called a spam farm. Figure 5.16 shows the simplest form of spam farm. From the point of
view of the spammer, the Web is divided into three parts:
1. Inaccessible pages: the pages that the spammer cannot affect. Most of the Web is in this
part.
2. Accessible pages: those pages that, while they are not controlled by the spammer, can be
affected by the spammer.
3. Own pages: the pages that the spammer owns and controls.
Figure: The Web from the point of view of the link spammer
2. Spam mass, a calculation that identifies the pages that are likely to be spam and allows the
search engine to eliminate those pages or to lower their PageRank strongly. Suppose page p
has PageRank r and TrustRank t. Then the spam mass of p is (rt)/r.
Example 5.12 : Let us consider both the PageRank and topic-sensitive Page-Rank that were
computed for the graph of Fig. 5.1 in Examples 5.2 and 5.10, respectively. In the latter case,
the teleport set was nodes B and D, so let us assume those are the trusted pages. Figure 5.17
tabulates the PageRank, TrustRank, and spam mass for each of the four nodes.
Example: A typical department at a university maintains a Web page listing all the courses
offered by the department, with links to a page for each course, telling about the course the
instructor, the text, an outline of the course content, and so on. If you want to know about a
certain course, you need the page for that course; the departmental course list will not do. The
course page is an authority for that course. However, if you want to find out what courses the
department is offering, it is not helpful to search for each courses page; you need the page
with the course list first. This page is a hub for information about courses.
HITS uses a mutually recursive definition of two concepts: a page is a good hub if it links
to good authorities, and a page is a good authority if it is linked to by good hubs.
To formalize the above intuition, we shall assign two scores to each Web page. One score
represents the hubbiness of a page that is, the degree to which it is a good hub, and the
second score represents the degree to which the page is a good authority. Assuming that
pages are enumerated, we represent these scores by vectors h and a. The ith component of h
gives the hubbiness of the ith page, and the ith component of a gives the authority of the same
page.
We normally scale the values of the vectors h and a so that the largest component is 1.
An alternative is to scale so that the sum of components is 1.
To describe the iterative computation of h and a formally, we use the link matrix of the
Web, L. If we have n pages, then L is an nn matrix, and Lij = 1 if there is a link from page i
to page j, and Lij = 0 if not. We shall also have need for LT, the transpose of L. That is, LT ij =
1 if there is a link from page j to page i, and LT ij = 0 otherwise. Notice that LT is similar to
the matrix M that we used for PageRank, but where LT has 1, M has a fraction 1 divided
by the number of out-links from the page represented by that column.
Example: For a running example, we shall use the Web of Figure -4, which we reproduce
here as Fig. 5.18. An important observation is that dead ends or spider traps do not prevent
the HITS iteration from converging to a meaningful pair of vectors. Thus, we can work with
the following Figure directly, with no taxation or alteration of the graph needed. The link
matrix L and its transpose are shown in the following Figure.
Figure: Sample data used for HITS example
Figure : The link matrix for the Web of the above Fig and its transpose The fact that the
hubbiness of a page is proportional to the sum of the authority of its successors is expressed by the
equation h = La, where is an unknown constant representing the scaling factor needed. Likewise,
the fact that the authority of a page is proportional to the sum of the hubbinesses of its predecessors
is expressed by a = LTh, where is another scaling constant. These equations allow us to compute
the hubbiness and authority independently, by substituting one equation in the other, as:
h = LLTh.
a = LTLa.
However, since LLT and LTL are not as sparse as L and LT, we are usually better off
computing h and a in a true mutual recursion. That is, start with h a vector of all 1s.
1. Compute a = LTh and then scale so the largest component is 1.
2. Next, compute h = La and scale again.
Now, we have a new h and can repeat steps (1) and (2) until at some iteration the changes to
the two vectors are sufficiently small that we can stop and accept the current values as the
limit.
Figure: First two iterations of the HITS algorithm
Example :- Let us perform the first two iterations of the HITS algorithm on the Web of Fig.
5.18. In Fig. 5.20 we see the succession of vectors computed. The first column is the initial h,
all 1s. In the second column, we have estimated the relative authority of pages by computing
LTh, thus giving each page the sum of the hubbinesses of its predecessors. The third column
gives us the first estimate of a. It is computed by scaling the second column; in this case we
have divided each component by 2, since that is the largest value in the second column.
The fourth column is La. That is, we have estimated the hubbiness of each page by summing
the estimate of the authorities of each of its successors. Then, the fifth column scales the
fourth column. In this case, we divide by 3, since that is the largest value in the fourth
column. Columns six through nine repeat the process outlined in our explanations for
columns two through five, but withthe better estimate of hubbiness given by the fifth column.
The limit of this process may not be obvious, but it can be computed by a simple program.
The limits are:
This result makes sense. First, we notice that the hubbiness of E is surely 0, since it leads
nowhere. The hubbiness of C depends only on the authority of E and vice versa, so it should
not surprise us that both are 0. A is the greatest hub, since it links to the three biggest
authorities, B, C, and D. Also, B and C are the greatest authorities, since they are linked to by
the two biggest hubs, A and D.
For Web-sized graphs, the only way of computing the solution to the hubsand- authorities
equations is iteratively. However, for this tiny example, we can compute the solution by
solving equations. We shall use the equations h = LLTh. First, LLT is
Let = 1/() and let the components of h for nodes A through E be a through e,
respectively. Then the equations for h can be written
Problem :
Compute the hubbiness and authority of each of the nodes in our original Webgraph of Fig-1.
! Problem:
Suppose our graph is a chain of n nodes, as was suggested by Fig. 9. Compute the hubs and
authorities vectors, as a function of n.