0% found this document useful (0 votes)
63 views

Syllabus

Uploaded by

vipay78199
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Syllabus

Uploaded by

vipay78199
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

B TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

ADT 453 INFORMATION Category L T P Credit


EXTRACTION AND
PEC 2 1 0 3
RETRIEVAL

Preamble:
Information Extraction and Retrieval is a course that focuses on the techniques and methodologies
for extracting relevant information from large volumes of unstructured data and retrieving it
efficiently. The course explores various approaches, algorithms, and tools used to process and
analyze textual data, enabling students to gain insights and make informed decisions. Topics
covered include text mining, information retrieval models, document indexing, query processing,
and evaluation techniques. Through this course, students will develop the skills necessary to extract
valuable information from diverse sources and build effective retrieval systems to support
information needs
Prerequisite: Basic knowledge in machine learning.

Mapping of course outcomes with program outcomes

CO1 Understand information retrieval fundamentals.(Cognitive Knowledge Level:


Understand)

CO2 Apply classic IR models And Analyze IR model effectiveness.(Cognitive Knowledge


Level: Apply)

CO3 Construct keyword-based queries and Apply Boolean query approaches(Cognitive


Knowledge Level: Apply)

CO4 Describe text and multimedia languages. Implement efficient indexing techniques and
search algorithms(Cognitive Knowledge Level: Apply)

CO5 Apply information extraction techniques and Evaluate chunking and


expansion(Cognitive Knowledge Level: Apply)

Mapping of course outcomes with program outcomes

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1
CO2

CO3

CO4

CO5

Abstract POs defined by National Board of Accreditation

PO# Broad PO PO# Broad PO

PO1 Engineering Knowledge PO7 Environment and Sustainability

PO2 Problem Analysis PO8 Ethics

PO3 Design/Development of solutions PO9 Individual and team work

Conduct investigations of
PO4 complex problems PO10 Communication

PO5 Modern tool usage PO11 Project Management and Finance

PO6 The Engineer and Society PO12 Life long learning

Assessment Pattern

Continuous Assessment Tests


Bloom’s End Semester Examination
Category Marks (%)
Test 1 (%) Test 2 (%)

Remember

Understand 30 30 30

Apply 70 70 70

Analyze

Evaluate
Create

\
Mark Distribution

Total Marks CIE Marks ESE Marks ESE Duration

150 50 100 3

Continuous Internal Evaluation Pattern:


Attendance 10 marks
Continuous Assessment Tests(Average of Internal Tests1&2) 25 marks

Continuous Assessment Assignment 15 marks

Internal Examination Pattern


Each of the two internal examinations has to be conducted out of 50 marks. First series test shall be preferably
conducted after completing the first half of the syllabus and the second series test shall be preferably
conducted after completing remaining part of the syllabus. There will be two parts: Part A and Part B. Part
A contains 5 questions (preferably, 2 questions each from the completed modules and 1 question from the
partly completed module), having 3 marks for each question adding up to 15 marks for part A. Students
should answer all questions from Part A. Part B contains 7 questions (preferably, 3 questions each from the
completed modules and 1 question from the partly completed module), each with 7 marks. Out of the 7
questions, a student should answer any5.

End Semester Examination Pattern:

There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from each module,
having 3 marks for each question. Students should answer all questions. Part B contains 2 full questions
from each module of which student should answer any one. Each question can have maximum 2 sub-
divisions and carries 14 marks.

Syllabus
Module – 1 (Introduction and Basic Concepts)

Introduction: Information versus Data Retrieval, IR: Past, present, and future. Basic concepts: The
retrieval process, logical view of documents. Modeling: A Taxonomy of IR models, ad-hoc
retrieval and filtering

Module – 2 (Classic IR Models and Retrieval Evaluation)


Classic IR models, Alternative Set theoretic models, Alternative algebraic models, Alternative
probabilistic models, Structured text retrieval models, models for browsing. Retrieval evaluation:
Performance evaluation of IR: Recall and Precision, other measures
Module – 3 (Reference Collections and Query Languages)
Reference Collections such as TREC, CACM, and ISI data sets. Query Languages: Keyword based
queries, single word queries, context queries, Boolean Queries, Query protocols.
Module– 4 (Text and Multimedia Languages, Indexing, and Searching)
Text and Multimedia Languages and properties, Metadata, Text formats, Markup languages,
Mult imedia dat a format s, Text Operat ions -Document preprocessing, Document
Clust ering, Text Compression,Comparing text compression techniques. Indexing and
searching -Inverted files, other indices for text, Sequent ial se arching-Brut e force,
knut h morris pratt, Pattern mat ching -string matching allowing errors.

Module 5 (Web based Information Extraction)

Web search basics - Background and history , Web characteristics, Advertising as the economic
model, The search user experience, Index size and estimation, Near-duplicates and shingling
Web crawling and indexes – Crawling, Distributing indexes, Connectivity servers
Link analysis - The Web as a graph, PageRank
Text Book

1. An Introduction to Information Retrieval, Christopher D. Manning,Prabhakar


Raghavan,Hinrich Schütze, Cambridge University Press
2. R. Baeza-Yates and B. R. Neto: Modern Information Retrieval:, Pearson Education, 2004
Reference Books
1. C.J. van Rijsbergen: Information Retrieval, Butterworths.
2. Introduction to Information Retrieval: Christopher D. Manning, Raghavan, and Schutze. 2000.
3. Information Retrieval: Algorithms and Heuristics (The Information Retrieval Series:2nd
Edition): David A. Grossman and Ophir Frieder.

Course Level Assessment Questions


Course Outcome1 (CO1):
1. What are the key differences between information retrieval and data retrieval? Provide
examples to illustrate their distinctions.
2. Discuss the evolution of information retrieval over time.
Course Outcome 2(CO2):
1. Compare and contrast the strengths and limitations of set-theoretic and probabilistic IR
models, and discuss real-world scenarios where one model may outperform the other.

2. Let X t be a random variable indicating whether the term t appears in a document. Suppose
we have | R | relevant documents in the document collection and that Xt = 1 in s of the
documents. Take the observed data to be just these observations of X t for each document
in R. Show that the MLE for the parameter p t = P ( Xt = 1 | R = 1, ~ q ) , that is, the value
for p t which maximizes the probability of the observed data, is p t = s/ | R | .
3. What is the relationship between the value of F1 and the break-even point?

Course Outcome 3(CO3):

1. Construct a Boolean query that retrieves documents containing the words "machine learning"
and "classification" but excludes any documents with the word "neural networks" present.
2. Explain the significance of reference collections in information retrieval research, and describe
the characteristics and importance of well-known collections like TREC and CACM.

Course Outcome 4(CO4): .


1. Describe index compression techniques?
2. How can clustering classified using statistical techniques.? Describe in detail.

Course Outcome 5(CO5):


1. Define web search and web search engine.
2. Explain crawling and types of crawling?

Model Question Paper

QP CODE:

Reg No: _______________

Name: _________________ PAGES : 4


APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY

EIGHTH SEMESTER B.TECH DEGREE EXAMINATION, MONTH & YEAR

Course Code: ADT 453

Course Name: INFORMATION EXTRACTION AND RETRIEVAL

Max.Marks:100 Duration: 3 Hours

PART A

Answer All Questions. Each Question Carries 3 Marks

1. Give the historical view of Information Retrieval.

2. What are the components of IR?

3. Why the Classic IR might lead to poor retrieval?

4. What is the relationship between the value of F1 and the break-even point?

5. Explain the concept of a Boolean query. Discuss the advantages and


limitations of using sequential searching for information retrieval.

6. List out the query protocols.

7. Write notes on parallel inverted files.

8. What are the desirable properties of clustering algorithm

9. What are the basic rules for Web crawler operation

10. Define web search and web search engine.


(10x3=30)

Part B
(Answer any one question from each module. Each question carries 14 Marks)

11. (a) Explain the Information Retrieval in detail (7)


(b) Explain the influence of AI in information retrieval (7)

OR

12. (a) Discuss the evolution of information retrieval over time. (7)

(b) What are the key differences between information retrieval and data (7)
retrieval? Provide examples to illustrate their distinctions.

13. (a) Compare and contrast the strengths and limitations of set-theoretic and (8)
probabilistic IR models, and discuss real-world scenarios where one
model may outperform the other.

(b) How can you find similarity between doc and query in probabilistic principle (6)
Using Bayes’ rule?

OR

14. (a) Explain in detail about vector-space retrieval models with an example (7)

(b) Write the formal characterization of IR Models (7)

15. (a) Construct a Boolean query that retrieves documents containing the words (6)
"machine learning" and "classification" but excludes any documents with
the word "neural networks" present.

(b) Explain keyword-based query in detail (8)

OR

16. (a) Explain the significance of reference collections in information retrieval (14)
research, and describe the characteristics and importance of well-known
collections like TREC and CACM.

17. (a) How can clustering classified using statistical techniques.? Describe in detail. (7)

(b) Discuss Brute force algorithm. (7)

OR
18. (a) Describe Text compression techniques? (6)

(b) Explain knuth morris pratt algorithm (8)

19. (a) What are the benefits of distributing Web search indexes? Explain the (7)
challenges and solutions for distributing indexes in a scalable and fault-
tolerant way.

(b) Explain crawling and types of crawling? (7)

OR

20. (a) Briefly explain web search architectures? (9)

(b) Explain page rank (5)

Teaching Plan

No. of
Lecture
No Contents Hours
(35 hrs)
Module-1(Introduction) (4 hours)

1.1 Information versus Data Retrieval, IR: Past, present, and future. 1 hour

1.2 Basic concepts: The retrieval process, logical view of documents. 1 hour

1.3 Modeling: A Taxonomy of IR models 1 hour

1.4 Ad-hoc retrieval and filtering 1 hour

Module-2 (IR Models and Retrieval Evaluation) (10 hours)

2.1 Classic IR models 2 hour


2.2 Alternative set theoretic models 1 hour

2.3 Alternative algebraic models 2 hour

2.4 Alternative probabilistic models 2 hour

2.5 Structured text retrieval models 1 hour

2.6 Models for browsing 1 hour

2.7 Retrieval evaluation: Performance evaluation of IR: Recall and Precision, 1 hour
other measures
Module-3 (Reference Collections and Query Languages) (5 hours)
3.1 Reference Collections such as TREC, CACM, and ISI data sets. 2 hour
Query Languages: Keyword based queries, single word queries, context
3.2 2 hour
queries, Boolean Queries
3.3 Query protocols 1 hour
Module-4 (Text and Multimedia Languages, Indexing, and Searching) (9 hours)
Text and Multimedia Languages and properties- Metadata, Text formats,
4.1 2 hour
Markup languages, Mult imedia data format s
4.2 Text Operat ions-Document preprocessing, Document Clust ering, 2 hour
4.3 Text Compression,Comparing t ext compression t echniques. 2 hour
4.4 Indexing and searching -Inverted files, ot her indices for t ext, 1 hour
4.5 Sequent ial searching -Brute force, knut h morris pratt 1 hour
4.6 Pattern mat ching-String mat ching allowing errors 1 hour
Module-5 (Fuzzy Applications) (7 hours)

5.1 Web search basics - Background and history , Web characteristics, Advertising 1 hour
as the economic model
The search user experience, Index size and estimation, Near-duplicates and
5.2 2 hour
shingling
Web crawling and indexes – Crawling, Distributing indexes, Connectivity
5.3 2 hour
servers
5.4 Link analysis - The Web as a graph, PageRank 2 hour

You might also like