0% found this document useful (0 votes)
18 views31 pages

Ir 73 103

The document discusses link analysis and specialized search techniques used for web search, including PageRank and HITS algorithms. It also covers searching and ranking web pages based on relevance scoring as well as personalized search using collaborative filtering and content-based recommendation.

Uploaded by

Madhurima Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Ir 73 103

The document discusses link analysis and specialized search techniques used for web search, including PageRank and HITS algorithms. It also covers searching and ranking web pages based on relevance scoring as well as personalized search using collaborative filtering and content-based recommendation.

Uploaded by

Madhurima Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT-IV

WEB SEARCH – LINK ANALYSIS AND SPECIALIZED SEARCH


Link Analysis–hubs and authorities– Page Rank and HITS algorithms -Searching and Ranking – Relevance
Scoring and ranking for Web – Similarity – Hadoop & Map Reduce – Evaluation – Personalized search –
Collaborative filtering and content-based recommendation of documents and products – handling “invisible”
Web – Snippet generation, Summarization, Question Answering, Cross- Lingual Retrieval.

4.1 LINK ANALYSIS

Links connecting pages are a key component of the Web.

Links are a powerful navigational aid for people browsing the Web, but they also help search
engines understand the relationships between the pages.

These detected relationships help search engines rank web pages more effectively. It should be
remembered, however, that many document collections used in search applications such asdesktop
or enterprise search either do not have links or have very little link structure.

4.1.1 Anchor Text

Anchor text has two properties that make it particularly useful for ranking web pages. It
tends to be very short, perhaps two or three words, and those words often succinctly describe the
topic of the linked page. For instance, links to www.ebay.com are highly likely to contain the word
“eBay” in the anchor text

Anchor text is usually written by people who are not the authors of the destination page.

For ibm how to distinguish between:


IBM home page ( mostly graphical page )
IBM copy right page ( high term frequency . for IBM )
Page 67
Rival Copyright page ( arbitrarily high frequency )

Can sometimes have unexpected effects, e.g., spam, miserable failure


Can score anchor text with weight depending on the authority of the anchor page’s
website
E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then
trust (more) the anchor text from them
Increase the weight of off-site anchors (non-nepotistic scoring)

4.2 HUBS AND AUTHORITIES

PART-A
Define Authorities(Nov/Dec’16)

Good hub page for a topic points to many authoritative pages for that topic.

A good authority page for a topic is pointed to by many good hubs for that topic.

Circular definition - will turn this into an iterative computation.

Hubs and Authorities

Page 68
High-level scheme

Extract from the web a base set of pages that could be good hubs or authorities.
From these, identify a small set of top hub and authority pages;
• iterative algorithm.
Base set

Given text query (say browser), use a text index to get all pages containing
browser.
→ Call this the root set of pages.
Add in any page that either
→ points to a page in the root set, or
→ is pointed to by a page in the root set.

Call this the base set.

Visualization

Get in-links (and out-links) from a connectivity server

Distilling hubs and authorities

Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
Initialize: for all x, h(x)1; a(x) 1;
Iteratively update all h(x), a(x);
After iterations
Output pages with highest h() scores as top hubs
Highest a() scores as top authorities.

Page 69
Convergence

4.3 PAGERANK AND HITS ALGORITHM

PART-B
Compare HITS and Page Rank in detail(Nov/Dec’16)
PART-B
Brief about HITS algorithm

Write short notes on top specific page rank Computation(Apr/may’17)

Page Rank is a method for rating the importance of web pages objectively and
mechanically using the link structure of the web.

• Page Rank is an algorithm used by Google Search to rank websites in their search
engine results. Page Rank was named after Larry Page, one of the founders of Google.
Page Rank is a way of measuring the importance of website pages. According to Google:

• Page Rank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that more
important websites are likely to receive more links from other websites.

• Searching with Page Rank : Two search Engines:

a. Title – based search engine


b. Full text search engine
Page 70
a. Title – based search engine
It searches only the “Title”. Finds all the web pages whose titles contain all the query
words.
Sorts the results by page Rank.
Very simple and cheap to implement.
Title match ensure high precision and page rank ensure high quality.

b. Full text search engine Also called Google. It examines allAlso called Google. It
examines all the words in every stored document and also performs Page Rank.
More precise but more complicated.

Citation Analysis

Citation frequency
Bibliographic coupling frequency
Articles that co-cite the same articles are related
Citation indexing
Who is this author cited by? (Garfield 1972)
Pagerank preview: Pinsker and Narin ’60s
Asked: which journals are authoritative?

Markov chains
A Markov chain consists of n states, plus an nn transition probability matrix
P.
At each step, we are in one of the states.
For 1  i,j  n, the matrix entry Pij tells us the probability of j being the next state, given
we are currently in state i.

Pij
Clearly, for all i,
Markov chains are abstractions of random walks.
Exercise: represent the teleporting random walk from 3 slides ago as a Markov
chain, for this case:

Page 71
Ergodic Markov chains
For any ergodic Markov chain, there is a unique long-term visit rate for each state.
▪ Steady-state probability distribution.
▪ Over a long time-period, we visit each state in proportion to this rate.
▪ It doesn’t matter where we start.

HITS ALGORITHMS( Hyperlink-Induced Topic Search )

• Page rank and HITS are two solutions to the same problem.

1. In the page rank model the value of the link depends on the link into S.
2. In the HITS model, it depends on the value of the other links out of S.

• The algorithm performs a series of iterations, each consisting of two basic steps:

Authority Update: Update each node's Authority score to be equal to the sum of
the Hub Scores of each node that points to it. That is, a node is given a high authority
score by being linked from pages that are recognized as Hubs for information.

Hub Update: Update each node's Hub Score to be equal to the sum of the Authority
Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes
that are considered to be authorities on the subject.

AT&T
Alice

ITIM
Bob
O2

4.4 SEARCHING AND RANKING


Web Query languages
• Web query languages require knowledge of the web site and the language syntax. They
are hard to use.
Page 72
Page 73
• Query is based on content of each page. The power of the web resides in its capability
of redirecting the information flow via hyperlinks, so it should appear natural that in
order to evaluate the information content of a web object, the web structure has to be
carefully analyzed.
• Recent experiments seem to confirm that hyperlinks can be very valuable in locating or
organizing information. They have been used:
To improve an initial ranking of documents.
To compute an estimate of a web pages popularity.
To find the most important hubs and authorities for a given topic.
Web Agents

• Web agents are complex software systems that operate in the World Wide Web,
the internet and related corporate, government or military intranets. They are designed to
perform a variety of tasks from caching and routing to searching categorizing andfiltering.

• The web agents reads the request, talks to the server and sends the results back to
the users web browser. A web agent can, for instance request a login web page, enter
appropriate login parameters, post the login request and when done return the resulting
web page to the caller.

• Agent might moves to one system to another to access remote resources and/or
meets other agents. Web agents perform variety of tasks like routing, searching,
categorizing and caching

4.5 RELEVANCE SCORING AND RANKING FOR WEB

Ranking of the documents on the basis of estimated relevance to the query is critical
Relevance ranking is based on factors such as

Term frequency
Frequency occurrences of query keywords in documents

Inverse document frequency


How many documents the query keyword occurs in.
Fewer-> give more importance to documents.

Relevance Ranking using Terms

TF-IDF

Page 74
A term occurring frequently in the document but rarely in the rest of the collection
is given high weight.
Many other ways of determining term weights have been proposed.
Experimentally, tf-idf has been found to work well.
wij = tfij idfi = tfij log2 (N/ dfi)
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and document frequencies of these terms
are:
A(50), B(1300), C(250)

Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6B:
tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0C: tf
= 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
⚫ Query vector is typically treated as a document and also tf-idf weighted.
⚫ Alternative is for the user to supply weights for the given query terms.

Relevance using Hyperlinks

When using keyword queries on the web, the number of documents is enormous.

Using term frequencies makes “spamming” easy.


Eg: Travel agent may add many occurrences of the word.

Most of the people are looking for pages from popular sites.

Refinement

When computing prestige based on links to a site , give more weightage to links from
sites that themselves have higher prestige.

Connections to social networking theories that ranked prestige of people.


Eg: President of US

Page 75
Hub and Authority base ranking

A hub is a page that stores many pages ( on a topic)

An authority is a page that contains actual information on a topic.

Each page gets a hub prestige based on prestige of authorities that it points to

Each page gets a authority prestige based on prestige of hubs that it points to it.

Gain prestige definitions are cyclic and can be got by solving linear equations

Use authority prestige when ranking answers to a query.

4.6 SIMILARITY

A similarity measure is a function that computes the degree of similarity between two
vectors.
Using a similarity measure between the query and each document:
It is possible to rank the retrieved documents in the order of presumed relevance.
It is possible to enforce a certain threshold so that the size of the retrieved set can be
controlled.
Similarity between vectors for the document di and query q can be computed as the
vector inner product (a.k.a. dot product):
sim(dj,q) = dj•q

where wij is the weight of term i in document j and wiq is the weight of term i in the
query

For binary vectors, the inner product is the number of matched query terms in the
document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
The inner product is unbounded.
Favors long documents with a large number of unique terms.
Measures how many terms matched but not how many terms are not matched.
\Weighted:
Page 76
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Cosine similarity measures the cosine of the angle between two vectors.
Inner product normalized by the vector lengths.
CosSim(dj, q) =
→ →
 (wij  wiq)
t

dj q •

→ → = i =1

dj  q

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2

= 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q =

0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner
product.

4.7 HADOOP AND MAP REDUCE

Hadoop provides a reliable shared storage and analysis system for large scale data
processing.

Storage provides HDFS (Distributed file system)

Analysis provides by Map reduce. (Distributed Data Processing model)

HDFS Architecture

Page 77
Page 78
Name Node:

Stores all metadata: File name, locations of each block on data nodes, file
attributes etc…

Data Node

Stores file contents as blocks

Different blocks of the same file are stored on different data nodes.

Data nodes exchange heartbeats with name node.

If no heartbeat received within a certain time period , data node assumed to be lost.

Losing name node is equivalent to losing all files on the system.

Hadoop provides two options

Backup files that make up the persistent state of the file system.

Run a Secondary Name Node.

Map Reduce

Map reduce is a method for distributing a task across multiple nodes.

Each node processes data stored on that node

Consist of two Phase:1.Map 2.Reduce.

Map Reduce Process

Page 79
Reduce Process:

4.8 EVALUATION

TREC Collection

TREC is a workshop series that provides the infrastructure for large-scale testing of
retrieval technology.

The Text Retrieval Conference co-sponsored by the National Institute of Standards and
Technology and U.S Department of Defense, was started in 1992 as part of TIPSTER
Text program.

TREC workshop series has the following goals:

• To encourage research in information retrieval based on large test collections.

Page 80
• To increase communication among industry, academic and government by
creating an open forum for the exchange of research ideas

• To speed the transfer of technology from research labs into commercial products
by demonstrating substantial improvements in retrieval methodologies on real-
world problems

• To increase the availability of appropriate evaluation techniques Set of tracks in


a particular TREC depends on:

Interest of participants
Appropriateness of task of TREC
Need of sponsors
Resource constraints

Evaluation measures at the TREC conference

• Summary table statistics – Single value measure can also be stored in a table
to provide a statistical summary regarding the set of all the queries in a
retrieval task.

• Recall-precision average – It consists of a table or graph with average


precision at 11 standard recall levels.

• Document level average- Average precision is computed at specified


document cutoff values.

• Average precision histogram- It consists of a graph which includes a single


measure for each separate topic.

The CACM and ISI Collection

It is small collections about computer science literature. It is text of 3204 documents. The
documents in the CACM test collection consists of all articles published in the
communication of the ACM.

CACM collection also includes information on structured subfields as follows:

• Word stems from the title and abstract sections.


• Categories
• Direct reference between connections
• Bibliographic coupling connections.
• Number of co-citations for each pair of articles.
Page 81
• Author names.
• Date information.

4.9 PERSONALIZED SEARCH

PART-B

Brief about Personalized search.(Nov/Dec’17)

In order to personalize search, we need to combine at least two different computational


techniques - contextualization and individualization

Contextualization - “the interrelated conditions that occur within an activity..includes


factors like the nature of information available, the information currently being
examined, and the applications in use”

Individualization - “the totality of characteristics that distinguishes an individual.. Uses


the user’s goals, prior and tacit knowledge, past information-seeking behaviors”

Main ways to personalize a search are “query augmentation” and “result processing”

Query augmentation - when a user enters a query, the query can be compared against
the contextual information available to determine if the query can be refined to include
other terms

Query augmentation can also be done by computing the similarity between the query
term and the user model - if the query is on a topic the user has previously seen, the
system can reinforce the query with similar terms

This more concise query is then shown to the user and “submitted to a search engine for
processing”

• Once the query has been augmented and processed by the search engine, the results can
be “individualized”

• The results being individualized - this means that the information is filtered based upon
information in the user’s model and/or context

• The user model “can re-rank search results based upon the similarity of the content of the
pages in the results and the user’s profile”

Page 82
• Another processing method is to re-rank the results based upon the “frequency, recency,
or duration of usage..providing users with the ability to identify the most popular, faddish
and time-consuming pages they’ve seen”

“Have Seen, Have Not seen” - this features allows new information to be identified
and return to information already seen

4.10 COLLABORATION FILTERING

PART-A

Define user based collaborative filtering(Nov/Dec’16)

PART-B

Explain in detail the collaborative filtering using clustering technique(Nov/Dec’17)

Explain in detail Colllaborative filtering and content based recommendation system with an
example(Apr/may’17)

A 9 A A 5 A A 6 A 10
B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
: : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1

Weight all users with respect to similarity with the active user.

Select a subset of the users (neighbors) to use as predictors.

Normalize ratings and compute a prediction from a weighted combination of the selected
neighbors’ ratings.

Page 83
Present items with highest predicted ratings as recommendations.

Neighbor Selection

For a given active user, a, select correlated users to serve as source of predictions.

Standard approach is to use the most similar n users, u, based on similarity weights, wa,u

Alternate approach is to include all users whose similarity weight is above a given
threshold.

Rating Prediction

Predict a rating, pa,i, for each item i, for active user, a, by using the n selected neighbor
users, u  {1,2,…n}.

To account for users different ratings levels, base predictions on differences from a user’s
average rating.

Weight users’ ratings contribution by their similarity to the active user.


n

w a,u (ru,i − ru )
pa,i = ra + u=1
n
| w
u =1
a,u |
Similarity Weighting

• Typically use Pearson correlation coefficient between ratings for active user, a,
and another user, u.

covar(ra , ru )
ca,u =
 ra  ru

ra and ru are the ratings vectors for the m items rated by


m

both a and u  (r a,i − ra )(ru,i − ru )

covar(ra , ru ) = i=1
m
ri,j is user i’s rating for item j

 (r x,i x

rx = i=1  = i=1

Covariance and Standard Deviation


Page 84
Page 85
• Covariance:
m

r x,i
rx = i=1
m

• Standard Deviation:

 (r x,i x
r = i=1
x

covar(ra , ru ) = i=1

Significance Weighting

Important not to trust correlations based on very few co-rated items.


wa,u = sa,uca,u

Include significance weights, sa,u, based on number of co-rated items, m.


1if m  50 
sa,u =  m if m  50
50 

Problems with Collaborative Filtering

Cold Start: There needs to be enough other users already in the system to find a
match.

Sparsity: If there are many items to be recommended, even if there are many
users, the user/ratings matrix is sparse, and it is hard to find users that have rated
the same items.

First Rater: Cannot recommend an item that has not been previously rated.

– New items

– Esoteric items

Page 86
Popularity Bias: Cannot recommend items to someone with unique tastes.

Content-Based Recommending

• Recommendations are based on information on the content of items rather than on


other users’ opinions.

• Uses machine learning algorithms to induce a profile of the users preferences


from examples based on a featural description of content.

• Lots of systems

Advantages of Content-Based Approach

• No need for data on other users.

– No cold-start or sparsity problems.

• Able to recommend to users with unique tastes.

• Able to recommend new and unpopular items

– No first-rater problem.

• Can provide explanations of recommended items by listing content-features that


caused an item to be recommended.

Page 87
• Well-known technology The entire field of Classification Learning is at (y)our
disposal!

Disadvantages of Content-Based Method

• Requires content that can be encoded as meaningful features.

• Users’ tastes must be represented as a learnable function of these content features.

• Unable to exploit quality judgments of other users.

– Unless these are somehow included in the content features.

Combining Content and Collaboration

• Content-based and collaborative methods have complementary strengths and


weaknesses.

• Combine methods to obtain the best of both.

• Various hybrid approaches:

– Apply both methods and combine recommendations.

– Use collaborative data as content.

– Use content-based predictor as another collaborator.

– Use content-based predictor to complete collaborative data.

Eg:Movie Domain

• Crawled Internet Movie Database (IMDb)


– Extracted content for titles in EachMovie.

• Basic movie information:


– Title, Director, Cast, Genre, etc.

• Popular opinions:
– User comments, Newspaper and Newsgroup reviews, etc.

Page 88
Content-Boosted Collaborative Filtering

4.11 HANDLING INVISIBLE WEB

Web sites that are hidden or are unable to be found or cataloged by regular search
engines
200,000+ Web sites
550 billion individual documents compared to the three billion of the surface Web
Contains 7,500 terabytes of information compared to nineteen terabytes in the surface
Web
Total quality content is 1,000 to 2,000 times greater than that of the surface Web.
Sixty of the largest sites collectively contain over 750 terabytes of information — They
exceed the size of the surface Web forty times.
Fastest growing category of new information on the Internet
Fifty percent greater monthly traffic than surface sites
More highly linked to than surface sites
Narrower, with deeper content, than conventional surface sites
More than half of the content resides in topic-specific databases
Content is highly relevant to every information need, market, and domain.
Not well known to the Internet-searching public
Usually carried out using a “directory” or “search engine”
Fast and efficient
Misses most of what is out there
70% of searchers start from three sites (Nielson, 2003): Google,Yahoo, and MSN.
Searching Tools
Directories
Search engines
1. Searchable databases:
Typing is required.

Page 89
Pages are not available until asked for (e.g., Library of Congress).
Pages are not static but dynamic (may not exist until requested).
Search engines can’t handle “dynamic pages.”
Search engines can’t handle “input boxes.”

2. password or login required:


(Spiders do not know passwords or login IDs.)

3. Non-HTML pages:
PDF, Word, Shockwave, Flash...

4. Script-based (computer generated) pages:


– Create all or part of a Web page
– Contain “?” in URL

4.12 SNIPPET GENERATION, SUMMARIZATION,QUESTION ANSWERING

PART-A

What is snippet generation?(Nov/Dec’16)

4.12.1 Snippet Generation

A snippet is a short summary of the document, which is designed so as to


allow the user to decide its relevance. Snippet is Query-dependent summary.

Snippet consists of the document title and a short summary, which is


automatically extracted.

Snippet generation steps

1.Rank each sentence in a document using a significant factor.

2.Select the top sentences for the summary.

Sentence selection:

Significance factor for a sentence is calculated based on the occurrence of


significance words.

If fd,w is the frequency of word w in document d, then w is a significant word if it is not


a stopword and

Page 90
Fd,w>={ 7-0.1*25-sd, if sd<25

{7 25<=sd<=40

{ 7+0.1*(sd-40), otherwise

w w w w w w w w w w

( Initial sentence )

W w s w s s w w s w

(identify significant words)

W w [s w s s w w s] w

( Text span bracketed by significant words)

Significant factor = 42/7=2.3

Page 91
Key Terms Extraction

Key term Extraction module has three sub modules like Query Term extraction, Title
Words Extraction and Meta Keywords Extraction.

Query term Extraction module gets parsed and translated query . Now its extracts
all the query terms from the query with their Boolean relations (AND/NOT) .

Sentence Extraction

This will take parsed text of the documents as inputs , filter the input parsed text and
extracts all the sentence from parsed the text. Two models Text filterization and Sentence
Extract

4.12.2 SUMMARIZATION
Page 92
A summary is a text that is produced from one or more texts and contains a
significant portion of the information in the original text is no longer than half of a text.

Generes of the Summary

▪ Indicative vs. informative


...used for quick categorization vs. content processing.
▪ Extract vs. abstract
...lists fragments of text vs. re-phrases content coherently.
▪ Generic vs. query-oriented
...provides author’s view vs. reflects user’s interest.
▪ Background vs. just-the-news
...assumes reader’s prior knowledge is poor vs. up-to-date.
▪ Single-document vs. multi-document source
...based on one text vs. fuses together many texts.
Summarization Machine

Modules Of Summarization Machine

Page 93
4.12.3 QUESTION ANSWERING

PART-B

Explain in detail about Community based Question Answering system.(Nov/Dec’17)

The main aim of QA is to present the user with a short answer to a question rather than a
list of possibly relevant documents.

As it become more and more difficult to find answers on the WWW using standard
search engines, question answering technology will become increasingly important.

4.13 CROSS LINGUAL

PART-B

Explain in detail about the working of Naïve Bayesian classifier with an example.(Nov/Dec’16)

Cross-Lingual retrieval refers to the retrieval of documents that are in a language


different from the one in which the query is expressed.

This allows users to search document collections in multiple languages and


retrieve relevance information in a form that is useful to them, even when they have little
or no linguistic competence in the target languages.

Cross lingual information retrieval is important for countries like India where very
large fraction of people are not conversant with English and thus don’t have access to the

Page 94
vast store f information on the web.

Two methods are used to solve this problem:


• Query translation:
• Translate English query into Chinese query
• Search Chinese document collection
• Translate retrieved results back into English
• Query translation is easy.
• Translation of documents must be performed at query time.

Document translation:
• Translate entire document collection into English
• Search collection in English
• Documents can be translated and stored offline. Automatic translation can be slow

Page 95
****************************************************************

Page 96
UNIT-V

DOCUMENT TEXT MINING Information filtering; organization and relevance feedback – Text
Mining -Text classification and clustering – Categorization algorithms: naive Bayes; decision
trees; and nearest neighbor – Clustering algorithms: agglomerative clustering; k-means;
expectation maximization (EM).

5.1 INFORMATION FILTERING; ORGANIZATION AND RELEVANCE FEEDBACK

PART-A

Differentiate between information filtering and information retrieval(Nov/Dec’17)

What are the characteristics of information filtering.(Nov/Dec’16)

Page 97

You might also like