CS8080 Irt
CS8080 Irt
Question Bank
3003
OBJECTIVES:
UNIT I INTRODUCTION 9
Information Retrieval – Early Developments – The IR Problem – The Users Task – Information versus
Data Retrieval – The IR System – The Software Architecture of the IR System – The Retrieval and
Ranking Processes – The Web – The e-Publishing Era – How the web changed Search – Practical
Issues on the Web – How People Search – Search Interfaces Today – Visualization in Search
Interfaces.
The Web – Search Engine Architectures – Cluster based Architecture – Distributed Architectures –
Search Engine Ranking – Link based Ranking – Simple Ranking Functions – Learning to Rank –
Evaluations — Search Engine Ranking – Search Engine User Interaction – Browsing – Applications
of a Web Crawler – Taxonomy – Architecture and Implementation – Scheduling Algorithms –
Evaluation.
TOTAL: 45 PERIODS
OUTCOMES:
Use an open source search engine framework and explore its capabilities
Apply appropriate method of classification or clustering.
Design and implement innovative features in a search engine.
Design and implement a recommender system.
TEXT BOOKS:
REFERENCES:
UNIT I INTRODUCTION
PART - A
1. Give any two advantages of using artificial intelligence in information retrieval tasks.
(Apr/May 2018) U
The advantages of using artificial intelligence in information retrieval tasks are as follows:
Information characterization
Search formulation in information seeking
System Integration
Support functions
3. How does the large amount of information available in web affect information retrieval
system implementation? (Apr/May 2018) U
Large amount of unstructured designed information is difficult to deal with. Obtaining specific
information is a hard mission and takes a lot of time. Information Retrieval System (IR) is a
way to solve this kind of problem. IR is a good mechanism but does not give the perfect
solution. It can cause the system to
Information overload
Time consuming
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
11. What is open source search framework? List examples. (Nov/Dec 2020) R
A search engine is a software program that helps people find the information they are looking for online
using search queries containing keywords or phrases. There exist some popular open-source search
engines which can be used to build search functionality in your website.
Apache Lucene
Apache Solr
Elasticsearch
MeiliSearch
Typesense
13. What are the major impacts that Web has had in the development of IR? R
The major impacts are:
Web Document Collection and Search Engine Optimization
Size of the collection and the volume of user queries submitted on a daily basis
Predicting relevance is much harder than before due to the vast size of the document
collection.
Web is not just a repository of documents and data, but also a medium to do business.
Web advertising and other economic incentives.
14. What are the performance measures of search engine? (Nov/Dec 2021) R
The two fundamental metrics are recall, measuring the ability of a search engine to find the
relevant material in the index, and precision, measuring its ability to place that relevant
material high in the ranking.
20. What are the major activities of the information seeking process model? R
The classic notion of the information seeking process model as described by Sutcliffe and
Ennis is formulated as a cycle consisting of four main activities:
Problem identification,
Articulation of information need(s),
Query formulation, and
Results evaluation.
Part B & C
4. Explain in detail about the features of IR. (Nov/Dec 2016 & Apr/May 2021) U
5. Write short notes on (Nov/Dec 2016)
i. Characterizing the web for search. U
ii. Role of AI in IR. AN
Part A
3. Can the tf-idf weight of a term in a document exceed 1? Why? (Apr/May 2018, Nov/Dec
2021) U
YES, the tf-idf weight of a term in a document exceeds 1. TF-IDF is a family of measures for
scoring a term with respect to a document (relevance). The simplest form of TF (word, document)
is the number of times word appears in document. TFIDF can be 1 in the naive case, or to add the
IDF effect, just do it log(number of documents/number of documents in which word is present).
4. Consider the two texts,” Tom and Jerry are friends” and “Jack and Tom are friends”. What
is the cosine similarity for these two texts? (Apr/May 2018, Nov/Dec 2021) U
a : 1,1,1,1,1,0
b : 1,1,0,1,1,1
sim(a,b) = 1*1+1*1+1*0+1*1+1*1+0*1/Sqrt(5)*Sqrt(5)= 0.804
8. What do you mean by relevance feedback? (Apr/May 2021 & Apr/May 2022) R
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance
feedback is to take the results that are initially returned from a given query, to gather user
feedback, and to use information about whether or not those results are relevant to perform a new
query.
14. Describe the differences between vector space relevance feedback and probabilistic relevance feedback.
(Nov/Dec 2020) AN
The tf-idf weighting is directly proportional to term frequency of the query term in the document
whereas the probabilistic just takes into account the absence or presence of term in the document.
1. highest when t occurs many times within a small number of documents (thus lending high
discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many documents (thus
offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.
16. Compare Term Frequency and Inverse Document Frequency. (Apr/May 2021) AN
Term frequency refers to the number of times that a term t occurs in document d. The inverse
document frequency is a measure of whether a term is common or rare in a given document
corpus. It is obtained by dividing the total number of documents by the number of documents
containing the term in the corpus.
17. List the two key statistics that are used to assess the effectiveness of an IR system. U
Precision
Recall
22. Is the vector space model always superior to the Boolean model? AN
No. Information retrieval using the Boolean model is usually faster than using the vector space
model. I believe that Boolean retrieval is a special case of the vector space model, so if you look
at ranking accuracy only, the vector space gives better or equivalent results.
23. What are the advantages and limitations of the vector space model? U
Advantages:
1.It is a simple model based on linear algebra
2.There weights are not binary
3. Allows the computing for a continuous degree of similarities between queries and documents.
Disadvantages:
1. Suffers from synonym and polysemy
2. It theoretically assumes that terms are statistically independent.
Part B & C
10. Explain in detail about vector space model for documents. (Nov/Dec 2018 & Apr/May 2022)
U
11. Explain TF-IDF weighting. (Apr/May 2022) U
12. Describe processing with sparse vectors. U
13. Explain probabilistic IR. U
14. Describe the language model based IR. (Nov/Dec 2020) U
15. When does relevance feedback work? (Nov/Dec 2018, Nov/Dec 2020) AN
Part A
1. List the basic types of machine leaning algorithm, depending on the learning mechanisms
used. R
Depending on the learning mechanisms used, the machine learning algorithm can be basically of 3
types
Supervised leaning
Unsupervised learning
Semi-supervised learning
Other types of learning algorithm includes
Reinforcement algorithm and
Transduction
2. Differentiate between relevance feedback and pseudo relevance feedback. (Nov/Dec 2018)
U
The idea of relevance feedback ( ) is to involve the user in the retrieval process so as to improve the
final result set. In particular, the user gives feedback on the relevance of documents in an initial set
of results.
Pseudo relevance feedback, also known as blind relevance feedback, provides a method for
automatic local analysis. It automates the manual part of relevance feedback, so that the user gets
improved retrieval performance without an extended interaction.
For example, news stories are typically organized by subject categories (topics) or geographical
codes; academic papers are often classified by technical domains and sub-domains; patient reports
in health-care organizations are often indexed from multiple aspects, using taxonomies of disease
categories, types of surgical procedures, insurance reimbursement codes and so on. Another
widespread application of text categorization is spam filtering, where email messages are classified
into the two categories of spam and non-spam, respectively.
5. How do spammers use cloaking to server spam to the web users? (Apr/May 2018) U
Cloaking is a spamming technique in which the content presented to the search engine spider is
different from the content presented to regular users. This is done by delivering content based on
the user-agent HTTP header of the user requesting the page, the IP address of a user or the referring
page: A web page can be cloaked based on the IP address, the user-agent, referring web page or any
combination of these three factors.
7. Can a digest of the characters in a web page be used detect near duplicate web pages? Why?
(Apr/May 2018) U
For every single web page, calculating a fingerprint that is a succinct (say 64-bit) digest of the
characters on that webpage is the modest method for detecting duplicates. When the fingerprints of
two webpage documents are the same, at that point we have to examine if the pages are the same
and if so, and then state that one of those to be a duplicate copy of the other.
Part B & C
1. Explain in detail about naïve bayes algorithm and its application in text classification.
(Nov/Dec 2019, Nov/Dec 2020, Apr/May 2021 & Apr/May 2022) U
2. Discuss in detail about SVM classifier and their use in text classification. (Apr/May 2022) U
of your organization. Elaborate in detail about the steps you will take and various factors to be
considered while designing. (Apr/May 2021) C
6. Describe the distributing indexes. U
7. Explain the paid placement. U
8. Explain web indexes. U
9. Explain the basic XML concepts. U
10. Illustrate the various challenges in XML retrieval with appropriate examples. AN
11. Explain the Rocchio algorithm for relevance feedback. (Nov/Dec 2020 & Nov/Dec 2021) U
12. Explain in detail about vector space model for XML retrieval.
(Nov/Dec 2019, Apr/May 2019, Nov/Dec 2020 & Nov/Dec 2021) U
The Web – Search Engine Architectures – Cluster based Architecture – Distributed Architectures –
Search Engine Ranking – Link based Ranking – Simple Ranking Functions – Learning to Rank –
Evaluations - Search Engine Ranking – Search Engine User Interaction – Browsing – Applications of
a Web Crawler – Taxonomy – Architecture and Implementation – Scheduling Algorithms –
Evaluation.
PART A
2. What are the performance measures of search engine? (Nov/Dec 2017, Nov/Dec 2018) U
Speed of response /size of index are factors
Need a way of quantifying user happy.
Precision, Recall
Technical precision
Pay per method
allows users to enter a search string, or query, and returns a page with click-able references or hits
to pages available on the Web.
6. What is the purpose of web crawler? (Nov/Dec 2016, Nov/Dec 2021 & Apr/May 2021) U
A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World
Wide Web, typically for the purpose of indexing. The purpose of web crawler includes,
Creating a Localized Search Engine.
Load Testing from multiple server locations & different countries.
Detecting SEO Optimizations on Various Pages, like missing Meta tags.
Generating Customized Reports, which log file analysis tools, might not create.
Spell-Checking Pages when working on large sites
8. List the characteristics of Map Reduce Strategy. (Nov/Dec 2017 & Nov/Dec 2021) R
Very large scale data: peta, Exabyte
Write once and read many data
Map and reduce operation are typically performed by same physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
6. Quality
7. Freshness
8. Extensible
26. Compute the jaccard’s similarity for the two list of words (time, flies, like, an, arrow) and
(how, time, flies). (Apr/May 2018) U
A:time,flies,like,an,arrow
B:how,time,flies
A B
1 1
1 1
1 0
1 0
1 0
0 1
JS(A,B)=2/6=0.804
30. Outline the differences between web search and information retrieval. (Apr/May 2022)
AN
Information retrieval systems (IRS) are field concerned with retrieval of information. A search
engine is the application of IR techniques. A web search engine is a tool to find information on
the www. Search engines are updating their index to the World Wide Web.
31. Define recall and precision in the context of a search engine. (Apr/May 2022) U
Recall is the number of relevant documents retrieved by a search divided by the total number
of existing relevant documents, while precision is the number of relevant documents retrieved
by a search divided by the total number of documents retrieved by that search.
32. How does relevance scoring works in web search? (Apr/May 2019) U
Search relevance is the measure of accuracy of the relationship between the search query and
the search results. Online users have high expectations. Thanks to the high bar set by sites like
Google, Amazon, and Netflix, they expect accurate, relevant, and rapid results.
Part B & C
1. Consider a web graph with three nodes 1, 2 and 3.The links are as follows: 1 2, 3 2, 21,
23.Write down the transition probability matrices for the surfers walk with teleporting, for
the teleport probability: a=0.5 and compute the page rank. (Nov/Dec 2018 & Nov/Dec 2021)
C
2. Explain in detail about Community-based Question Answering system.
(Nov/Dec 2017, Nov/Dec 2018, Nov/Dec 2021) U
3. How do the various nodes of a distributed crawler communicate and share URLs?
(Nov/Dec 2018 & Nov/Dec 2021) U
4. Explain in detail about finger print algorithm for near-duplicate detection.
(Nov/Dec 2017, Nov/Dec 2018, Nov/Dec 2021 & Apr/May 2021) U
5. Brief about search engine optimization.
(Nov/Dec 2017, Nov/Dec 2020, Nov/Dec 2021 & Apr/May 2021) U
6. Elaborate on the search engine architectures.
(Nov/Dec 2016, Nov/Dec 2021, Apr/May 2021 & Apr/May 2022) U
7. Describe meta and focused crawling. (Nov/Dec 2016) U
8. Explain the features and architecture of web crawler.
(Nov/Dec 2018, Apr/May 2018, Apr/May 2019 & Nov/Dec 2021) U
9. Explain about online selection in web crawling. (Nov/Dec 2018 & Nov/Dec 2021) U
10. Discuss the design of a Question–Answer engine with the various phases involved. How can
the performance of such an engine be measured? (Apr/May 2018 & Apr/May 2019) R
11. Brief on Personalized search. (Nov/Dec 2017, Nov/Dec 2018 & Nov/Dec 2021) R
12. Explain in detail, the Collaborative Filtering using clustering technique.
(Nov/Dec 2017 & Nov/Dec 2021) R
13. Brief about HITS link analysis algorithm.
(Nov/Dec 2016, Nov/Dec 2017, Nov/Dec 2018, Nov/Dec 2019, Apr/May 2019, Nov/Dec
2020, Nov/Dec 2021 & Apr/May 2022) R
14. Explain in detail cross lingual information retrieval and its limitations in web search.
(Nov/Dec 2016) U
15. How does Map reduce work? Illustrate the usage of map reduce programming model in
Hadoop. (Apr/May 2018, Nov/Dec 2019, Apr/May 2019 & Nov/Dec 2020) U
16. Explain the page rank computation. AP
17. Describe Markov chain process. U
18. Explain web as a graph. U
19. Explain relevance scoring. U
20. How to handle invisible web? AN
21. Discuss about the snippet generation. (Apr/May 2019) U
22. Explain summarization. (Apr/May 2019) U
23. Discuss in detail about the typical user interaction models for the most popular Search
Engines of today. U
24. How is Crawling evaluation done? AN
Part A
17. Write the pros and cons of using classification in text mining over clustering algorithms.
(Apr/May 2019) AN
There is a fundamental difference between classification, which is abstract and definitive, and
grouping, which depends on the characteristics of a particular sample or set of observations that is
concrete and is thus descriptive.
Advantages of Classification:
It facilitates the identification of organisms. It explains how different creatures interact with one
another. It aids in the comprehension of organism evolution. It helps to understand how animals,
plants, and other living creatures are related and how they can benefit humans.
Disadvantages of Classification:
A disadvantage to classification is that many of the classifications themselves are based on subjective
judgments, which may or may not be shared by everyone participating.
18. Outline the difference between classification and clustering. (Nov/Dec 2019) AN
Classification
It is used with supervised learning.
It is a process where the input instances are classified based on their respective class labels.
It has labels hence there is a need to train and test the dataset to verify the model.
It is more complex in comparison to clustering.
Examples: Logistic regression, Naive Bayes classifier, Support vector machines.
Clustering
It is used with unsupervised learning.
It groups the instances based on how similar they are, without using class labels.
It is not needed to train and test the dataset.
It is less complex in comparison to classification.
Examples: k-means clustering algorithm, Gaussian (EM) clustering algorithm.
Part B & C
1. Explain in detail about the different classes of recommendation approaches along with its advantages
and disadvantages. (Apr/May 2021) U
2. With respect Recommendation system explain in detail about the data and knowledge sources. U
3. Explain in detail about the high-level architecture of content-based system. U
4. Enlist the eleven popular tasks proposed by Herlocker that a RS can assist in implementation. U
5. Why Product Recommendation Engines Are Not Good Product Search Engines? Explain U
6. Explain the role of Matrix factorization techniques in Collaborative filtering. (Apr/May 2021) U
7. Explain in detail about Neighborhood models of Collaborative filtering. (Apr/May 2022) U
8. Compare the accuracy of collaborative filtering algorithms using Netflix data. AN
9. Write short note on text mining. (Nov/Dec 2021) R
10. Explain in detail about agglomerative clustering. Compare it with other clustering algorithms.
(Nov/Dec 2021) U
11. Explain decision tree algorithm with example. (Nov/Dec 2019, Apr/May 2019 & Nov/Dec 2020) R
12. Explain K-means algorithm of clustering with example. (Nov/Dec 2019 & Nov/Dec 2020) R
13. Explain the application of Expectation-Maximization (EM) algorithm in text mining.
(Apr/May 2021) AN