IR Unit II

Uploaded by

shelakeavi2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

IR Unit II

Uploaded by

shelakeavi2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 78

Unit II

-Edith Juni
Link Analysis
• WWW
• Other Applications
• Web page references
WWW
HITS
HITS Algorithm
• Jon Kleinberg's algorithm called HITS identifies good authorities and hubs for a
topic by assigning two numbers to a page: an authority and a hub weight. These
weights are defined recursively. A higher authority weight occurs if the page is
pointed to by pages with high hub weights. A higher hub weight occurs if the page
points to many pages with high authority weights.
• A good hub increases the authority weight of the pages it points. A good authority
increases the hub weight of the pages that point to it. The idea is then to apply the
two operations above alternatively until equilibrium values for the hub and
authority weights are reached.
In Simple words
• Step 1: Find the adjacency matrix A of the
given graph
• Step 2: Find the transpose of the adjacency
matrix
• Step 3: Assume initial hub weight vector u =1.
• Step 4: Compute the authority weight vector
v= .u
• Step 5: Then find the updated hub weight
u=A.v
• Step 6: Final Result of comparing the Hub
weight with Authority weights.
like node 1 is hub since 2>0
node 3 is authority since 0<2.
Example
Example
Page rank
• So what is PageRank?
– PageRank is a “vote”, by all the other pages on the Web, about how
important a page is.
– A link to a page counts as a vote of support.
– If there’s no link there’s no support (but it’s an abstention from voting
rather than a vote against the page).
– The original pagerank algorithm was designed by Lawrence Page and
Sergey Brin.
• How is PageRank Used?
– PageRank is one of the methods Google uses to determine a page’s
relevance or importance.
– It is only one part of the story when it comes to the Google listing, but the
other aspects are discussed elsewhere (and are ever changing) and
PageRank is interesting enough to deserve a paper of its own.
Pagerank
• We begin by picturing the Web net as a directed graph, with nodes
represented by web pages and edges represented by the links
between them.
• Suppose for instance, that we have a small Internet consisting of
just 4 web sites www.page1.com, www.page2.com,
www.page3.com, www.page4.com, referencing each other in the
manner suggested by the picture:
• We "translate" the picture into a directed graph with 4 nodes, one for each
web site.
• When web site i references j, we add a directed edge between node i and
node j in the graph.
• For the purpose of computing their page rank, we ignore any navigational
links such as back, next buttons, as we only care about the connections
between different web sites.
• For instance, Page1 links to all of the other pages, so node 1 in the graph
will have outgoing edges to all of the other nodes.
• Page3 has only one link, to Page 1, therefore node 3 will have one outgoing
edge to node 1.
• After analyzing each web page, we get the following graph:
• In our model, each page should transfer evenly its importance to the
pages that it links to.
• Node 1 has 3 outgoing edges, so it will pass on of its importance to each
of the other 3 nodes.
• Node 3 has only one outgoing edge, so it will pass on all of its importance
to node 1.
• In general, if a node has k outgoing edges, it will pass on of its importance
to each of the nodes that it links to.
• Let us better visualize the process by assigning weights to each edge.
Example
Example
SIMILARITY
• A similarity measure is a function that computes the
degree of similarity between two vectors.
• Using a similarity measure between the query and
each document: It is possible to rank the retrieved
documents in the order of presumed relevance.
• It is possible to enforce a certain threshold so that
the size of the retrieved set can be controlled.
• Document Similarity.
What is Hadoop

• Hadoop framework consists on two main layers

– Distributed file system (HDFS)
– Execution engine (MapReduce)
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave shared-nothing architecture
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing
paradigm
• Yahoo: Developing Hadoop open-source of
MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
Hadoop: How it Works
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Hadoop Distributed File System (HDFS)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
• Replication: Each data block is replicated many
times (default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS
– Namenode is consistently checking Datanodes
MapReduce Phases

Deciding on what will be the key and what will be the value  developer’s
responsibility
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
– Decides on where to run each mapper (concept of locality)

• Task Tracker is the slave node (runs on each datanode)

– Receives the task from Job Tracker
– Runs the task until completion (either map or reduce task)
– Always in communication with the Job Tracker reporting progress
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
– Consume <key, value> pairs
– Produce <key, value> pairs
• Reducers:
– Consume <key, <list of values>>
– Produce <key, value>
• Shuffling and Sorting:
– Hidden phase between mappers and reducers
– Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Example 2: Color Count
Job: Count the number of each color in a data set
Example 3: Color Filter

SON Approves 80 Standards For CNG Vehicles and Related Appliances
No ratings yet
SON Approves 80 Standards For CNG Vehicles and Related Appliances
1 page
Amigurumi Free
80% (5)
Amigurumi Free
71 pages
Network Analysis For Wikipedia: F. Bellomi and R. Bonato
No ratings yet
Network Analysis For Wikipedia: F. Bellomi and R. Bonato
12 pages
Link Analysis
No ratings yet
Link Analysis
43 pages
Ir 73 103
No ratings yet
Ir 73 103
31 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Link-Analysis-AH
No ratings yet
Link-Analysis-AH
18 pages
SNA-UNIT-2 Full
No ratings yet
SNA-UNIT-2 Full
33 pages
The Use of The Linear Algebra by Web Search Engines
No ratings yet
The Use of The Linear Algebra by Web Search Engines
5 pages
ABUSIDU - MIT Information Retrieval_ Exercise 4
No ratings yet
ABUSIDU - MIT Information Retrieval_ Exercise 4
5 pages
PageRank Algorithm - The Mathematics of Google Search
No ratings yet
PageRank Algorithm - The Mathematics of Google Search
8 pages
Course 5-6
No ratings yet
Course 5-6
54 pages
Information Networks and World Wide Web
No ratings yet
Information Networks and World Wide Web
37 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
33 pages
Applications of Stochastic Models in Web Page Ranking
No ratings yet
Applications of Stochastic Models in Web Page Ranking
8 pages
Unit-2
No ratings yet
Unit-2
14 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Exp 10
No ratings yet
Exp 10
8 pages
The_Use_of_the_Linear_Algebra_by_Web_Search_Engine
No ratings yet
The_Use_of_the_Linear_Algebra_by_Web_Search_Engine
6 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
INFORMATION RETRIEVAl - 4
No ratings yet
INFORMATION RETRIEVAl - 4
30 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
Logo - File 3
No ratings yet
Logo - File 3
4 pages
Google PageRank Algorithm
No ratings yet
Google PageRank Algorithm
10 pages
Assignment5 NLA Aug2023
No ratings yet
Assignment5 NLA Aug2023
7 pages
What Is A Mapreduce?: Michael Kleber
No ratings yet
What Is A Mapreduce?: Michael Kleber
19 pages
What Is Mapreduce
No ratings yet
What Is Mapreduce
19 pages
Issues in Sequential Web Page Ranking Algorithms
No ratings yet
Issues in Sequential Web Page Ranking Algorithms
5 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Pagerank: Standing On The Shoulders of Giants
No ratings yet
Pagerank: Standing On The Shoulders of Giants
10 pages
1953122.1953146
No ratings yet
1953122.1953146
10 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Algorithms
No ratings yet
Algorithms
49 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
IR unit 3
No ratings yet
IR unit 3
64 pages
SW MIDS
No ratings yet
SW MIDS
5 pages
4.link Analysis and Page Rank S4
No ratings yet
4.link Analysis and Page Rank S4
32 pages
Unit7 Advance Topics Unit 8 Search Engines
No ratings yet
Unit7 Advance Topics Unit 8 Search Engines
6 pages
SNA Unit2 LearningMaterial
No ratings yet
SNA Unit2 LearningMaterial
16 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google
No ratings yet
The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google
13 pages
Data Mining and Semantic Web
No ratings yet
Data Mining and Semantic Web
25 pages
Clustering of Hub and Authority Web Docu
No ratings yet
Clustering of Hub and Authority Web Docu
5 pages
Web Mining: BY: Anitha K 17EUEE017
No ratings yet
Web Mining: BY: Anitha K 17EUEE017
19 pages
IR
No ratings yet
IR
3 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
UNIT 2 Notes - Refer
No ratings yet
UNIT 2 Notes - Refer
22 pages
Lecture 7 - The Web as a Graph
No ratings yet
Lecture 7 - The Web as a Graph
29 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
feb_28
No ratings yet
feb_28
12 pages
Blue Modern Pitch Deck Presentation
No ratings yet
Blue Modern Pitch Deck Presentation
13 pages
IR_finals_cheat_sheet
No ratings yet
IR_finals_cheat_sheet
2 pages
13-Overview of Web mining-11-11-2024
No ratings yet
13-Overview of Web mining-11-11-2024
35 pages
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
No ratings yet
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
11 pages
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
From Everand
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
Ian Eress
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Learn Design and Analysis of Algorithms in 24 Hours
From Everand
Learn Design and Analysis of Algorithms in 24 Hours
Alex Nordeen
No ratings yet
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
ISSA Senior Fitness Certification Chapter Preview
0% (1)
ISSA Senior Fitness Certification Chapter Preview
21 pages
RNB Menu Jan
No ratings yet
RNB Menu Jan
20 pages
Theory. Drag Force
No ratings yet
Theory. Drag Force
2 pages
An LP Primer: How The LP Works
No ratings yet
An LP Primer: How The LP Works
4 pages
Plant Adaptations To Dry Environments PDF
No ratings yet
Plant Adaptations To Dry Environments PDF
2 pages
Hoja de Seguridad Bio
No ratings yet
Hoja de Seguridad Bio
1 page
Gear Design
No ratings yet
Gear Design
121 pages
Empirical and Analytical Modelling of System Dynamics
No ratings yet
Empirical and Analytical Modelling of System Dynamics
12 pages
Applications 3
No ratings yet
Applications 3
16 pages
Part List Ms 840
No ratings yet
Part List Ms 840
32 pages
Street Hardware and Furniture
No ratings yet
Street Hardware and Furniture
25 pages
Natural Rest for Addiction A Radical Approach to Recovery Through Mindfulness and Awareness Full Book Access
100% (4)
Natural Rest for Addiction A Radical Approach to Recovery Through Mindfulness and Awareness Full Book Access
16 pages
Safety Manual Blue Peter
No ratings yet
Safety Manual Blue Peter
102 pages
Result of Safety Patrol K3
No ratings yet
Result of Safety Patrol K3
2 pages
Practice Test 16 (2025 Format)
No ratings yet
Practice Test 16 (2025 Format)
10 pages
Letter Grade Numerical Grade Percentage
No ratings yet
Letter Grade Numerical Grade Percentage
14 pages
Mathematics-III Practice Problems
No ratings yet
Mathematics-III Practice Problems
5 pages
10kW All-In-One The Hybrid Inverter User Manual: Important Notice
No ratings yet
10kW All-In-One The Hybrid Inverter User Manual: Important Notice
54 pages
reviewer transline midterm
No ratings yet
reviewer transline midterm
5 pages
Drug Study Cushing's Syndrome
No ratings yet
Drug Study Cushing's Syndrome
5 pages
Gess 201
No ratings yet
Gess 201
6 pages
Essential Users Guide Complete Booklet Rev 2015
100% (1)
Essential Users Guide Complete Booklet Rev 2015
32 pages
Fundamentals of The Physics of Solids Volume II
100% (2)
Fundamentals of The Physics of Solids Volume II
659 pages
John Ashbery On Stein
100% (1)
John Ashbery On Stein
6 pages
Oil Debris Monitoring Chip Detector Electronic) : Sensors
No ratings yet
Oil Debris Monitoring Chip Detector Electronic) : Sensors
3 pages
Sunbelt 2022 Agenda FINAL
No ratings yet
Sunbelt 2022 Agenda FINAL
68 pages
Guia Vivienda HB ENG 202212 PDF
No ratings yet
Guia Vivienda HB ENG 202212 PDF
37 pages

IR Unit II

Uploaded by

IR Unit II

Uploaded by

Unit II

• Hadoop framework consists on two main layers

• Task Tracker is the slave node (runs on each datanode)

You might also like