Syllabus
Syllabus
Preamble:
Information Extraction and Retrieval is a course that focuses on the techniques and methodologies
for extracting relevant information from large volumes of unstructured data and retrieving it
efficiently. The course explores various approaches, algorithms, and tools used to process and
analyze textual data, enabling students to gain insights and make informed decisions. Topics
covered include text mining, information retrieval models, document indexing, query processing,
and evaluation techniques. Through this course, students will develop the skills necessary to extract
valuable information from diverse sources and build effective retrieval systems to support
information needs
Prerequisite: Basic knowledge in machine learning.
CO4 Describe text and multimedia languages. Implement efficient indexing techniques and
search algorithms(Cognitive Knowledge Level: Apply)
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1
CO2
CO3
CO4
CO5
Conduct investigations of
PO4 complex problems PO10 Communication
Assessment Pattern
Remember
Understand 30 30 30
Apply 70 70 70
Analyze
Evaluate
Create
\
Mark Distribution
150 50 100 3
There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from each module,
having 3 marks for each question. Students should answer all questions. Part B contains 2 full questions
from each module of which student should answer any one. Each question can have maximum 2 sub-
divisions and carries 14 marks.
Syllabus
Module – 1 (Introduction and Basic Concepts)
Introduction: Information versus Data Retrieval, IR: Past, present, and future. Basic concepts: The
retrieval process, logical view of documents. Modeling: A Taxonomy of IR models, ad-hoc
retrieval and filtering
Web search basics - Background and history , Web characteristics, Advertising as the economic
model, The search user experience, Index size and estimation, Near-duplicates and shingling
Web crawling and indexes – Crawling, Distributing indexes, Connectivity servers
Link analysis - The Web as a graph, PageRank
Text Book
2. Let X t be a random variable indicating whether the term t appears in a document. Suppose
we have | R | relevant documents in the document collection and that Xt = 1 in s of the
documents. Take the observed data to be just these observations of X t for each document
in R. Show that the MLE for the parameter p t = P ( Xt = 1 | R = 1, ~ q ) , that is, the value
for p t which maximizes the probability of the observed data, is p t = s/ | R | .
3. What is the relationship between the value of F1 and the break-even point?
1. Construct a Boolean query that retrieves documents containing the words "machine learning"
and "classification" but excludes any documents with the word "neural networks" present.
2. Explain the significance of reference collections in information retrieval research, and describe
the characteristics and importance of well-known collections like TREC and CACM.
QP CODE:
PART A
4. What is the relationship between the value of F1 and the break-even point?
Part B
(Answer any one question from each module. Each question carries 14 Marks)
OR
12. (a) Discuss the evolution of information retrieval over time. (7)
(b) What are the key differences between information retrieval and data (7)
retrieval? Provide examples to illustrate their distinctions.
13. (a) Compare and contrast the strengths and limitations of set-theoretic and (8)
probabilistic IR models, and discuss real-world scenarios where one
model may outperform the other.
(b) How can you find similarity between doc and query in probabilistic principle (6)
Using Bayes’ rule?
OR
14. (a) Explain in detail about vector-space retrieval models with an example (7)
15. (a) Construct a Boolean query that retrieves documents containing the words (6)
"machine learning" and "classification" but excludes any documents with
the word "neural networks" present.
OR
16. (a) Explain the significance of reference collections in information retrieval (14)
research, and describe the characteristics and importance of well-known
collections like TREC and CACM.
17. (a) How can clustering classified using statistical techniques.? Describe in detail. (7)
OR
18. (a) Describe Text compression techniques? (6)
19. (a) What are the benefits of distributing Web search indexes? Explain the (7)
challenges and solutions for distributing indexes in a scalable and fault-
tolerant way.
OR
Teaching Plan
No. of
Lecture
No Contents Hours
(35 hrs)
Module-1(Introduction) (4 hours)
1.1 Information versus Data Retrieval, IR: Past, present, and future. 1 hour
1.2 Basic concepts: The retrieval process, logical view of documents. 1 hour
2.7 Retrieval evaluation: Performance evaluation of IR: Recall and Precision, 1 hour
other measures
Module-3 (Reference Collections and Query Languages) (5 hours)
3.1 Reference Collections such as TREC, CACM, and ISI data sets. 2 hour
Query Languages: Keyword based queries, single word queries, context
3.2 2 hour
queries, Boolean Queries
3.3 Query protocols 1 hour
Module-4 (Text and Multimedia Languages, Indexing, and Searching) (9 hours)
Text and Multimedia Languages and properties- Metadata, Text formats,
4.1 2 hour
Markup languages, Mult imedia data format s
4.2 Text Operat ions-Document preprocessing, Document Clust ering, 2 hour
4.3 Text Compression,Comparing t ext compression t echniques. 2 hour
4.4 Indexing and searching -Inverted files, ot her indices for t ext, 1 hour
4.5 Sequent ial searching -Brute force, knut h morris pratt 1 hour
4.6 Pattern mat ching-String mat ching allowing errors 1 hour
Module-5 (Fuzzy Applications) (7 hours)
5.1 Web search basics - Background and history , Web characteristics, Advertising 1 hour
as the economic model
The search user experience, Index size and estimation, Near-duplicates and
5.2 2 hour
shingling
Web crawling and indexes – Crawling, Distributing indexes, Connectivity
5.3 2 hour
servers
5.4 Link analysis - The Web as a graph, PageRank 2 hour