IR Unit II
IR Unit II
-Edith Juni
Link Analysis
• WWW
• Other Applications
• Web page references
WWW
HITS
HITS Algorithm
• Jon Kleinberg's algorithm called HITS identifies good authorities and hubs for a
topic by assigning two numbers to a page: an authority and a hub weight. These
weights are defined recursively. A higher authority weight occurs if the page is
pointed to by pages with high hub weights. A higher hub weight occurs if the page
points to many pages with high authority weights.
• A good hub increases the authority weight of the pages it points. A good authority
increases the hub weight of the pages that point to it. The idea is then to apply the
two operations above alternatively until equilibrium values for the hub and
authority weights are reached.
In Simple words
• Step 1: Find the adjacency matrix A of the
given graph
• Step 2: Find the transpose of the adjacency
matrix
• Step 3: Assume initial hub weight vector u =1.
• Step 4: Compute the authority weight vector
v= .u
• Step 5: Then find the updated hub weight
u=A.v
• Step 6: Final Result of comparing the Hub
weight with Authority weights.
like node 1 is hub since 2>0
node 3 is authority since 0<2.
Example
Example
Page rank
• So what is PageRank?
– PageRank is a “vote”, by all the other pages on the Web, about how
important a page is.
– A link to a page counts as a vote of support.
– If there’s no link there’s no support (but it’s an abstention from voting
rather than a vote against the page).
– The original pagerank algorithm was designed by Lawrence Page and
Sergey Brin.
• How is PageRank Used?
– PageRank is one of the methods Google uses to determine a page’s
relevance or importance.
– It is only one part of the story when it comes to the Google listing, but the
other aspects are discussed elsewhere (and are ever changing) and
PageRank is interesting enough to deserve a paper of its own.
Pagerank
• We begin by picturing the Web net as a directed graph, with nodes
represented by web pages and edges represented by the links
between them.
• Suppose for instance, that we have a small Internet consisting of
just 4 web sites www.page1.com, www.page2.com,
www.page3.com, www.page4.com, referencing each other in the
manner suggested by the picture:
• We "translate" the picture into a directed graph with 4 nodes, one for each
web site.
• When web site i references j, we add a directed edge between node i and
node j in the graph.
• For the purpose of computing their page rank, we ignore any navigational
links such as back, next buttons, as we only care about the connections
between different web sites.
• For instance, Page1 links to all of the other pages, so node 1 in the graph
will have outgoing edges to all of the other nodes.
• Page3 has only one link, to Page 1, therefore node 3 will have one outgoing
edge to node 1.
• After analyzing each web page, we get the following graph:
• In our model, each page should transfer evenly its importance to the
pages that it links to.
• Node 1 has 3 outgoing edges, so it will pass on of its importance to each
of the other 3 nodes.
• Node 3 has only one outgoing edge, so it will pass on all of its importance
to node 1.
• In general, if a node has k outgoing edges, it will pass on of its importance
to each of the nodes that it links to.
• Let us better visualize the process by assigning weights to each edge.
Example
Example
SIMILARITY
• A similarity measure is a function that computes the
degree of similarity between two vectors.
• Using a similarity measure between the query and
each document: It is possible to rank the retrieved
documents in the order of presumed relevance.
• It is possible to enforce a certain threshold so that
the size of the retrieved set can be controlled.
• Document Similarity.
What is Hadoop
Deciding on what will be the key and what will be the value developer’s
responsibility
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
– Decides on where to run each mapper (concept of locality)