Ir 73 103
Ir 73 103
Links are a powerful navigational aid for people browsing the Web, but they also help search
engines understand the relationships between the pages.
These detected relationships help search engines rank web pages more effectively. It should be
remembered, however, that many document collections used in search applications such asdesktop
or enterprise search either do not have links or have very little link structure.
Anchor text has two properties that make it particularly useful for ranking web pages. It
tends to be very short, perhaps two or three words, and those words often succinctly describe the
topic of the linked page. For instance, links to www.ebay.com are highly likely to contain the word
“eBay” in the anchor text
Anchor text is usually written by people who are not the authors of the destination page.
PART-A
Define Authorities(Nov/Dec’16)
Good hub page for a topic points to many authoritative pages for that topic.
A good authority page for a topic is pointed to by many good hubs for that topic.
Page 68
High-level scheme
Extract from the web a base set of pages that could be good hubs or authorities.
From these, identify a small set of top hub and authority pages;
• iterative algorithm.
Base set
Given text query (say browser), use a text index to get all pages containing
browser.
→ Call this the root set of pages.
Add in any page that either
→ points to a page in the root set, or
→ is pointed to by a page in the root set.
Visualization
Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
Initialize: for all x, h(x)1; a(x) 1;
Iteratively update all h(x), a(x);
After iterations
Output pages with highest h() scores as top hubs
Highest a() scores as top authorities.
Page 69
Convergence
PART-B
Compare HITS and Page Rank in detail(Nov/Dec’16)
PART-B
Brief about HITS algorithm
Page Rank is a method for rating the importance of web pages objectively and
mechanically using the link structure of the web.
• Page Rank is an algorithm used by Google Search to rank websites in their search
engine results. Page Rank was named after Larry Page, one of the founders of Google.
Page Rank is a way of measuring the importance of website pages. According to Google:
• Page Rank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that more
important websites are likely to receive more links from other websites.
b. Full text search engine Also called Google. It examines allAlso called Google. It
examines all the words in every stored document and also performs Page Rank.
More precise but more complicated.
Citation Analysis
Citation frequency
Bibliographic coupling frequency
Articles that co-cite the same articles are related
Citation indexing
Who is this author cited by? (Garfield 1972)
Pagerank preview: Pinsker and Narin ’60s
Asked: which journals are authoritative?
Markov chains
A Markov chain consists of n states, plus an nn transition probability matrix
P.
At each step, we are in one of the states.
For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given
we are currently in state i.
Pij
Clearly, for all i,
Markov chains are abstractions of random walks.
Exercise: represent the teleporting random walk from 3 slides ago as a Markov
chain, for this case:
Page 71
Ergodic Markov chains
For any ergodic Markov chain, there is a unique long-term visit rate for each state.
▪ Steady-state probability distribution.
▪ Over a long time-period, we visit each state in proportion to this rate.
▪ It doesn’t matter where we start.
• Page rank and HITS are two solutions to the same problem.
1. In the page rank model the value of the link depends on the link into S.
2. In the HITS model, it depends on the value of the other links out of S.
• The algorithm performs a series of iterations, each consisting of two basic steps:
Authority Update: Update each node's Authority score to be equal to the sum of
the Hub Scores of each node that points to it. That is, a node is given a high authority
score by being linked from pages that are recognized as Hubs for information.
Hub Update: Update each node's Hub Score to be equal to the sum of the Authority
Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes
that are considered to be authorities on the subject.
AT&T
Alice
ITIM
Bob
O2
• Web agents are complex software systems that operate in the World Wide Web,
the internet and related corporate, government or military intranets. They are designed to
perform a variety of tasks from caching and routing to searching categorizing andfiltering.
• The web agents reads the request, talks to the server and sends the results back to
the users web browser. A web agent can, for instance request a login web page, enter
appropriate login parameters, post the login request and when done return the resulting
web page to the caller.
• Agent might moves to one system to another to access remote resources and/or
meets other agents. Web agents perform variety of tasks like routing, searching,
categorizing and caching
Ranking of the documents on the basis of estimated relevance to the query is critical
Relevance ranking is based on factors such as
Term frequency
Frequency occurrences of query keywords in documents
TF-IDF
Page 74
A term occurring frequently in the document but rarely in the rest of the collection
is given high weight.
Many other ways of determining term weights have been proposed.
Experimentally, tf-idf has been found to work well.
wij = tfij idfi = tfij log2 (N/ dfi)
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and document frequencies of these terms
are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6B:
tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0C: tf
= 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
⚫ Query vector is typically treated as a document and also tf-idf weighted.
⚫ Alternative is for the user to supply weights for the given query terms.
When using keyword queries on the web, the number of documents is enormous.
Most of the people are looking for pages from popular sites.
Refinement
When computing prestige based on links to a site , give more weightage to links from
sites that themselves have higher prestige.
Page 75
Hub and Authority base ranking
Each page gets a hub prestige based on prestige of authorities that it points to
Each page gets a authority prestige based on prestige of hubs that it points to it.
Gain prestige definitions are cyclic and can be got by solving linear equations
4.6 SIMILARITY
A similarity measure is a function that computes the degree of similarity between two
vectors.
Using a similarity measure between the query and each document:
It is possible to rank the retrieved documents in the order of presumed relevance.
It is possible to enforce a certain threshold so that the size of the retrieved set can be
controlled.
Similarity between vectors for the document di and query q can be computed as the
vector inner product (a.k.a. dot product):
sim(dj,q) = dj•q
where wij is the weight of term i in document j and wiq is the weight of term i in the
query
For binary vectors, the inner product is the number of matched query terms in the
document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
The inner product is unbounded.
Favors long documents with a large number of unique terms.
Measures how many terms matched but not how many terms are not matched.
\Weighted:
Page 76
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Cosine similarity measures the cosine of the angle between two vectors.
Inner product normalized by the vector lengths.
CosSim(dj, q) =
→ →
(wij wiq)
t
dj q •
→ → = i =1
dj q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner
product.
Hadoop provides a reliable shared storage and analysis system for large scale data
processing.
HDFS Architecture
Page 77
Page 78
Name Node:
Stores all metadata: File name, locations of each block on data nodes, file
attributes etc…
Data Node
Different blocks of the same file are stored on different data nodes.
If no heartbeat received within a certain time period , data node assumed to be lost.
Backup files that make up the persistent state of the file system.
Map Reduce
Page 79
Reduce Process:
4.8 EVALUATION
TREC Collection
TREC is a workshop series that provides the infrastructure for large-scale testing of
retrieval technology.
The Text Retrieval Conference co-sponsored by the National Institute of Standards and
Technology and U.S Department of Defense, was started in 1992 as part of TIPSTER
Text program.
Page 80
• To increase communication among industry, academic and government by
creating an open forum for the exchange of research ideas
• To speed the transfer of technology from research labs into commercial products
by demonstrating substantial improvements in retrieval methodologies on real-
world problems
Interest of participants
Appropriateness of task of TREC
Need of sponsors
Resource constraints
• Summary table statistics – Single value measure can also be stored in a table
to provide a statistical summary regarding the set of all the queries in a
retrieval task.
It is small collections about computer science literature. It is text of 3204 documents. The
documents in the CACM test collection consists of all articles published in the
communication of the ACM.
PART-B
Main ways to personalize a search are “query augmentation” and “result processing”
Query augmentation - when a user enters a query, the query can be compared against
the contextual information available to determine if the query can be refined to include
other terms
Query augmentation can also be done by computing the similarity between the query
term and the user model - if the query is on a topic the user has previously seen, the
system can reinforce the query with similar terms
This more concise query is then shown to the user and “submitted to a search engine for
processing”
• Once the query has been augmented and processed by the search engine, the results can
be “individualized”
• The results being individualized - this means that the information is filtered based upon
information in the user’s model and/or context
• The user model “can re-rank search results based upon the similarity of the content of the
pages in the results and the user’s profile”
Page 82
• Another processing method is to re-rank the results based upon the “frequency, recency,
or duration of usage..providing users with the ability to identify the most popular, faddish
and time-consuming pages they’ve seen”
“Have Seen, Have Not seen” - this features allows new information to be identified
and return to information already seen
PART-A
PART-B
Explain in detail Colllaborative filtering and content based recommendation system with an
example(Apr/may’17)
A 9 A A 5 A A 6 A 10
B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
: : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1
Weight all users with respect to similarity with the active user.
Normalize ratings and compute a prediction from a weighted combination of the selected
neighbors’ ratings.
Page 83
Present items with highest predicted ratings as recommendations.
Neighbor Selection
For a given active user, a, select correlated users to serve as source of predictions.
Standard approach is to use the most similar n users, u, based on similarity weights, wa,u
Alternate approach is to include all users whose similarity weight is above a given
threshold.
Rating Prediction
Predict a rating, pa,i, for each item i, for active user, a, by using the n selected neighbor
users, u {1,2,…n}.
To account for users different ratings levels, base predictions on differences from a user’s
average rating.
w a,u (ru,i − ru )
pa,i = ra + u=1
n
| w
u =1
a,u |
Similarity Weighting
• Typically use Pearson correlation coefficient between ratings for active user, a,
and another user, u.
covar(ra , ru )
ca,u =
ra ru
covar(ra , ru ) = i=1
m
ri,j is user i’s rating for item j
(r x,i x
rx = i=1 = i=1
r x,i
rx = i=1
m
• Standard Deviation:
(r x,i x
r = i=1
x
covar(ra , ru ) = i=1
Significance Weighting
Cold Start: There needs to be enough other users already in the system to find a
match.
Sparsity: If there are many items to be recommended, even if there are many
users, the user/ratings matrix is sparse, and it is hard to find users that have rated
the same items.
First Rater: Cannot recommend an item that has not been previously rated.
– New items
– Esoteric items
Page 86
Popularity Bias: Cannot recommend items to someone with unique tastes.
Content-Based Recommending
• Lots of systems
– No first-rater problem.
Page 87
• Well-known technology The entire field of Classification Learning is at (y)our
disposal!
Eg:Movie Domain
• Popular opinions:
– User comments, Newspaper and Newsgroup reviews, etc.
Page 88
Content-Boosted Collaborative Filtering
Web sites that are hidden or are unable to be found or cataloged by regular search
engines
200,000+ Web sites
550 billion individual documents compared to the three billion of the surface Web
Contains 7,500 terabytes of information compared to nineteen terabytes in the surface
Web
Total quality content is 1,000 to 2,000 times greater than that of the surface Web.
Sixty of the largest sites collectively contain over 750 terabytes of information — They
exceed the size of the surface Web forty times.
Fastest growing category of new information on the Internet
Fifty percent greater monthly traffic than surface sites
More highly linked to than surface sites
Narrower, with deeper content, than conventional surface sites
More than half of the content resides in topic-specific databases
Content is highly relevant to every information need, market, and domain.
Not well known to the Internet-searching public
Usually carried out using a “directory” or “search engine”
Fast and efficient
Misses most of what is out there
70% of searchers start from three sites (Nielson, 2003): Google,Yahoo, and MSN.
Searching Tools
Directories
Search engines
1. Searchable databases:
Typing is required.
Page 89
Pages are not available until asked for (e.g., Library of Congress).
Pages are not static but dynamic (may not exist until requested).
Search engines can’t handle “dynamic pages.”
Search engines can’t handle “input boxes.”
3. Non-HTML pages:
PDF, Word, Shockwave, Flash...
PART-A
Sentence selection:
Page 90
Fd,w>={ 7-0.1*25-sd, if sd<25
{7 25<=sd<=40
{ 7+0.1*(sd-40), otherwise
w w w w w w w w w w
( Initial sentence )
W w s w s s w w s w
W w [s w s s w w s] w
Page 91
Key Terms Extraction
Key term Extraction module has three sub modules like Query Term extraction, Title
Words Extraction and Meta Keywords Extraction.
Query term Extraction module gets parsed and translated query . Now its extracts
all the query terms from the query with their Boolean relations (AND/NOT) .
Sentence Extraction
This will take parsed text of the documents as inputs , filter the input parsed text and
extracts all the sentence from parsed the text. Two models Text filterization and Sentence
Extract
4.12.2 SUMMARIZATION
Page 92
A summary is a text that is produced from one or more texts and contains a
significant portion of the information in the original text is no longer than half of a text.
Page 93
4.12.3 QUESTION ANSWERING
PART-B
The main aim of QA is to present the user with a short answer to a question rather than a
list of possibly relevant documents.
As it become more and more difficult to find answers on the WWW using standard
search engines, question answering technology will become increasingly important.
PART-B
Explain in detail about the working of Naïve Bayesian classifier with an example.(Nov/Dec’16)
Cross lingual information retrieval is important for countries like India where very
large fraction of people are not conversant with English and thus don’t have access to the
Page 94
vast store f information on the web.
Document translation:
• Translate entire document collection into English
• Search collection in English
• Documents can be translated and stored offline. Automatic translation can be slow
Page 95
****************************************************************
Page 96
UNIT-V
DOCUMENT TEXT MINING Information filtering; organization and relevance feedback – Text
Mining -Text classification and clustering – Categorization algorithms: naive Bayes; decision
trees; and nearest neighbor – Clustering algorithms: agglomerative clustering; k-means;
expectation maximization (EM).
PART-A
Page 97