0% found this document useful (0 votes)
9 views

IR Unit II

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

IR Unit II

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Unit II

-Edith Juni
Link Analysis
• WWW
• Other Applications
• Web page references
WWW
HITS
HITS Algorithm
• Jon Kleinberg's algorithm called HITS identifies good authorities and hubs for a
topic by assigning two numbers to a page: an authority and a hub weight. These
weights are defined recursively. A higher authority weight occurs if the page is
pointed to by pages with high hub weights. A higher hub weight occurs if the page
points to many pages with high authority weights.
• A good hub increases the authority weight of the pages it points. A good authority
increases the hub weight of the pages that point to it. The idea is then to apply the
two operations above alternatively until equilibrium values for the hub and
authority weights are reached.
In Simple words
• Step 1: Find the adjacency matrix A of the
given graph
• Step 2: Find the transpose of the adjacency
matrix
• Step 3: Assume initial hub weight vector u =1.
• Step 4: Compute the authority weight vector
v= .u
• Step 5: Then find the updated hub weight
u=A.v
• Step 6: Final Result of comparing the Hub
weight with Authority weights.
like node 1 is hub since 2>0
node 3 is authority since 0<2.
Example
Example
Page rank
• So what is PageRank?
– PageRank is a “vote”, by all the other pages on the Web, about how
important a page is.
– A link to a page counts as a vote of support.
– If there’s no link there’s no support (but it’s an abstention from voting
rather than a vote against the page).
– The original pagerank algorithm was designed by Lawrence Page and
Sergey Brin.
• How is PageRank Used?
– PageRank is one of the methods Google uses to determine a page’s
relevance or importance.
– It is only one part of the story when it comes to the Google listing, but the
other aspects are discussed elsewhere (and are ever changing) and
PageRank is interesting enough to deserve a paper of its own.
Pagerank
• We begin by picturing the Web net as a directed graph, with nodes
represented by web pages and edges represented by the links
between them.
• Suppose for instance, that we have a small Internet consisting of
just 4 web sites www.page1.com, www.page2.com,
www.page3.com, www.page4.com, referencing each other in the
manner suggested by the picture:
• We "translate" the picture into a directed graph with 4 nodes, one for each
web site.
• When web site i references j, we add a directed edge between node i and
node j in the graph.
• For the purpose of computing their page rank, we ignore any navigational
links such as back, next buttons, as we only care about the connections
between different web sites.
• For instance, Page1 links to all of the other pages, so node 1 in the graph
will have outgoing edges to all of the other nodes.
• Page3 has only one link, to Page 1, therefore node 3 will have one outgoing
edge to node 1.
• After analyzing each web page, we get the following graph:
• In our model, each page should transfer evenly its importance to the
pages that it links to.
• Node 1 has 3 outgoing edges, so it will pass on of its importance to each
of the other 3 nodes.
• Node 3 has only one outgoing edge, so it will pass on all of its importance
to node 1.
• In general, if a node has k outgoing edges, it will pass on of its importance
to each of the nodes that it links to.
• Let us better visualize the process by assigning weights to each edge.
Example
Example
SIMILARITY
• A similarity measure is a function that computes the
degree of similarity between two vectors.
• Using a similarity measure between the query and
each document: It is possible to rank the retrieved
documents in the order of presumed relevance.
• It is possible to enforce a certain threshold so that
the size of the retrieved set can be controlled.
• Document Similarity.
What is Hadoop

• Hadoop framework consists on two main layers


– Distributed file system (HDFS)
– Execution engine (MapReduce)
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave shared-nothing architecture
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing
paradigm
• Yahoo: Developing Hadoop open-source of
MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
Hadoop: How it Works
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Hadoop Distributed File System (HDFS)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
• Replication: Each data block is replicated many
times (default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS
– Namenode is consistently checking Datanodes
MapReduce Phases

Deciding on what will be the key and what will be the value  developer’s
responsibility
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
– Decides on where to run each mapper (concept of locality)

• Task Tracker is the slave node (runs on each datanode)


– Receives the task from Job Tracker
– Runs the task until completion (either map or reduce task)
– Always in communication with the Job Tracker reporting progress
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
– Consume <key, value> pairs
– Produce <key, value> pairs
• Reducers:
– Consume <key, <list of values>>
– Produce <key, value>
• Shuffling and Sorting:
– Hidden phase between mappers and reducers
– Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Example 2: Color Count
Job: Count the number of each color in a data set
Example 3: Color Filter

You might also like