0% found this document useful (0 votes)

12 views

1-Introduction-MIR

The document outlines the course CE-324: Modern Information Retrieval at Sharif University of Technology, taught by Mahdieh Soleymani in Spring 2024. It includes course information, communication methods, marking schemes, project details, and an overview of information retrieval concepts and challenges. Key topics covered in the course include indexing, IR models, web IR, and advanced learning techniques in information retrieval.

Uploaded by

aidin.zaeim

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

1-Introduction-MIR

Uploaded by

aidin.zaeim

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Course overview and introduction

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani
Spring 2024

Some slides have been adapted from: Profs. Manning & Nayak
lectures (CS-276, Stanford)
Course info
• Instructors: Mahdieh Soleymani
• Email: [email protected]

• Office hours: set appointment through email.

• Head TA: Mahdi Ghaznavi ([email protected])

2
Communication
• Quera
• Policies and rules
• Tentative schedule
• Slides and notes
• Projects
• Discussion

• Email
• Private questions

3
Text book

• Introduction to Information Retrieval, C.D. Manning, P.

Raghavan and H. Schuetze, Cambridge University Press, 2008.
• Free online version is available at: http://informationretrieval.org/
• Papers

4
Marking scheme

• Midterm: 20%
• Final Exam: 25%
• Quizzes: 10%
• Project (multiple phases): 45%

5
About homework assignments

• 3-4 project assignments (practical)

• Projects are implementation-heavy

• Language of choice: Python

6
Projects: Late policy

• Everyone gets up to 6 total slack days

• You can distribute them across your projects expect to

the last project for that you are now allowed any slack
day

• Once you use up your slack days, all subsequent late

submissions will accrue a 10% penalty for each day (on
top of any other penalties)

7
Collaboration policy

• We follow the CE Department Honor Code – read it

carefully.
• Don’t look at code of others; everything you submit
should be your own work
• Don’t share your solution or code with others although
discussing general ideas is fine and encouraged
• Indicate in your submissions anyone you worked with

8
Typical IR system
} Given: corpus & user query
} Find:A ranked set of docs relevant to the query.
Document
Corpus: A collection of documents
corpus

Query IR System

A list of
Ranked
Documents
9
Information Retrieval (IR)

• Information Retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections [IIR
Book].

• Retrieving relevant documents to a query (while retrieving as

few non-relevant documents as possible)
• especially from large sets of documents efficiently.

10
Information Retrieval (IR)

– These days we frequently think first of web search, but

there are many other cases:
• Corporate search engines
• E-mail search
• Searching your laptop
• Legal information retrieval

11
12
13
14
15
16
Basic Definitions
• Document: a unit decided to build a retrieval system
over
• textual: a sequence of words, punctuation, etc that express
ideas about some topic in a natural language.

• Corpus or collection: a set of documents

• Information need: information required by the user

about some topics

• Query: formulation of the information need

17
Heuristic nature of IR

• Problem: Semantic gap between query and docs

• A doc is relevant if the user perceives that this doc contains his
information need
• How to extract information from docs and how to use it to
decide relevance

• Solution: IR system must interpret and rank docs

according to the amount of relevance to the user’s query.
• “The notion of relevance is at the center of IR.”

18
Minimize search overhead

• Search overhead: Time spent in all steps leading to the

reading of items containing the needed information
• Steps: query generation, query execution, scanning results,
reading non-relevant items, etc.

• The amount of online data has grown at least as quickly

as the speed of computers

19
Condensing the data (indexing)
} Indexing the corpus to speed up the searching task
} Using the index instead of linearly scanning the docs that is
computationally expensive for large collections
} Indexing depends on the query language and IR model

} Term (index unit): A word, phrase, and other groups of

symbols used for retrieval
} Index terms are useful for remembering the document themes

20
Typical IR system architecture
Text
User
Interface
user need Text

Text Operations

Query
Indexing
user feedback Operations Corpus

query

Searching Index

retrieved docs

Ranking
ranked docs
21
IR system components
• Text Operations forms index terms
• Tokenization, stop word removal, stemming, …

• Indexing constructs an index for a corpus of docs.

• Query Operations transform the query to improve

retrieval:
• Query expansion using a thesaurus or query transformation
using relevance feedback

• Searching retrieves docs that are related to the query.

22
IR system components (continued)

• Ranking retrieved documents according to their

relevance.

• User Interface manages interaction with the user:

• Query input and visualization of results
• Relevance feedback

23
Structured vs. unstructured docs

• Unstructured text (free text): a continuous sequence of

tokens

• Structured text (fielded text): text is broken into fields

that are distinguished by tags or other markup

• Semi-structured text
• e.g. web page

24
Databases vs. IR:
Structured vs. unstructured data

• Structured: data tends to refer to information in “tables”

Student Name Student ID Supervisor GPA
Name
Smith 20116671 Joes 12
Joes 20114190 Chang 14.1
Lee 20095900 Chang 19

Typically allows numerical range and exact match

(for text) queries, e.g.,
GPA < 16 AND Supervisor = Chang.

25
Semi-structured data

• In fact almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title
and Bullets

• Facilitates “semi-structured” search such as

• Title contains data AND Bullets contain search

… to say nothing of linguistic structure

26
Unstructured (text) vs. structured (database)
data in the mid-nineties

250

200

150
Unstructured
100 Structured

0
Data volume Market Cap

27
Unstructured (text) vs. structured (database)
data today

250

200

150

100

0
Data volume Market Cap

28
Data retrieval vs. information retrieval
• Data retrieval
• which items contain a set of keywords? Or satisfy the given
(e.g., regular expression like) user query?
• well defined structure and semantics
• a single erroneous object implies failure!

• Information retrieval
• information about a subject
• semantics is frequently loose (natural language is not well
structured and may be ambiguous)
• small errors are tolerated

29
Sec. 1.1

Evaluation of results
• Precision: Fraction of retrieved docs that are relevant to
user’s information need
 Precision = relevant retrieved / total retrieved
= |Retrieved Ç Relevant | / |Retrieved |

• Recall: Fraction of relevant docs that are retrieved

 Recall = relevant retrieved / relevant exist
= |Retrieved Ç Relevant | / | Relevant |
Retrieved & Relevant

Retrieved Relevant

30
Example

• Assume that there are 8 relevant docs to the query 𝑄.

• List of the retrieved docs for 𝑄 :
• d1: R
3
• d2: NR 𝑃=
7
• d3: R
3
• d4: R 𝑅=
8
• d5: NR
• d6: NR
• d7: NR

31
Web Search
• Application of IR to (HTML) documents on the World
Wide Web.

• Web IR
• collect doc corpus by crawling the web
• exploit the structural layout of docs
• Beyond terms, exploit the link structure (ideas from
social networks)
• link analysis, clickstreams ...

32
Web IR

Web

Crawler corpus

Query IR System

A list of
Ranked Pages

33
The web and its challenges

• Web collection properties

• Distributed nature of the web collection
• Size of the collection and volume of the user queries
• Web advertisement (web is a medium for business too)
• Predicting relevance on the web
• Docs change uncontrollably (dynamic and volatile data )
• Unusual and diverse (heterogeneous) docs, users, and
queries

34
Course main topics
• Indexing & text operations
• IR Models
• Boolean, vector space, probabilistic
• Evaluation of IR systems
• Web IR
• Crawling
• Duplication removal
• Link-based algorithms
• Learning in IR:
• Classification & clustering
• Learning to rank
• (Distributed) word representation
• NNs and deep embedding models
• LLMs & RAG
• Some advanced topics
35

(Stefan Buettcher Charles L. A. Clarke Gordon
100% (2)
(Stefan Buettcher Charles L. A. Clarke Gordon
633 pages
IBM+BPM+"Housekeeping"+Best+Practices IBM Internal
No ratings yet
IBM+BPM+"Housekeeping"+Best+Practices IBM Internal
13 pages
1-Overview of Information Retrieval_new
No ratings yet
1-Overview of Information Retrieval_new
47 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
IR chapter 1 (2)
No ratings yet
IR chapter 1 (2)
29 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
1_IR_Introductionn (1)
No ratings yet
1_IR_Introductionn (1)
30 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
29 pages
Information Retrieval Techniques(1)
No ratings yet
Information Retrieval Techniques(1)
59 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
2.notes CS8080 - Information Retrieval Technique
No ratings yet
2.notes CS8080 - Information Retrieval Technique
164 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Ch1 Intro To Information Retrieval-Lina Nemri
No ratings yet
Ch1 Intro To Information Retrieval-Lina Nemri
23 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
IR ASS1
No ratings yet
IR ASS1
12 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Introduction To Information Retrieval - by William Scott - Medium
No ratings yet
Introduction To Information Retrieval - by William Scott - Medium
4 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
1_introIR
No ratings yet
1_introIR
15 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
CS8080 Irt
100% (1)
CS8080 Irt
33 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
IR Ch23 Text Representation
No ratings yet
IR Ch23 Text Representation
36 pages
Module 1print
No ratings yet
Module 1print
5 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Unit 1.1
No ratings yet
Unit 1.1
54 pages
1 introIR
No ratings yet
1 introIR
22 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
1
No ratings yet
1
12 pages
IR_MOD1_NOTES
No ratings yet
IR_MOD1_NOTES
20 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
Ontology-Based Information Retrieval For Historical Documents
No ratings yet
Ontology-Based Information Retrieval For Historical Documents
5 pages
Subject Coverage Manual 2021
No ratings yet
Subject Coverage Manual 2021
28 pages
P 01 Intro
No ratings yet
P 01 Intro
70 pages
File Structures Mini Project Human Resource Records
0% (1)
File Structures Mini Project Human Resource Records
33 pages
PT861ENGREV01
No ratings yet
PT861ENGREV01
33 pages
Iran Hostage Crisis: Nicholas Young Junior Division Individual Webpage
No ratings yet
Iran Hostage Crisis: Nicholas Young Junior Division Individual Webpage
3 pages
Bsbinm201-Process &maintain Workplace Information
50% (2)
Bsbinm201-Process &maintain Workplace Information
32 pages
Web Design
No ratings yet
Web Design
103 pages
Internet Terminology
No ratings yet
Internet Terminology
12 pages
RECIPROCAL SOLUTIONfinal
No ratings yet
RECIPROCAL SOLUTIONfinal
56 pages
3d Searching....
80% (5)
3d Searching....
24 pages
CS 3308 Discussion Assignment Unit 2
No ratings yet
CS 3308 Discussion Assignment Unit 2
6 pages
Greenstone Digital Library Software ASSIGNMENT
100% (1)
Greenstone Digital Library Software ASSIGNMENT
10 pages
Web Technologies PDF
No ratings yet
Web Technologies PDF
33 pages
Plagiarism Detection IEEE PAPER
No ratings yet
Plagiarism Detection IEEE PAPER
13 pages
CyberSense Cyber Recovery Post Attack Workflow
No ratings yet
CyberSense Cyber Recovery Post Attack Workflow
22 pages
PHP App To Call c4c API
No ratings yet
PHP App To Call c4c API
20 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Thesis Information Retrieval
100% (2)
Thesis Information Retrieval
8 pages
AutoWeb Guide 2.0a
100% (1)
AutoWeb Guide 2.0a
45 pages
Silo - Tips Astrometrynet-Documentation
No ratings yet
Silo - Tips Astrometrynet-Documentation
67 pages
Media Cloud: An Open Cloud Computing Middleware For Content Management
No ratings yet
Media Cloud: An Open Cloud Computing Middleware For Content Management
6 pages
Django Haystack
No ratings yet
Django Haystack
126 pages
Part 1: Microsoft Word: Duration 1 Hours
No ratings yet
Part 1: Microsoft Word: Duration 1 Hours
4 pages
2021 07 26.13.04.19 CS8080 Information Retrieval Techniques Reg 2017 Question Bank
No ratings yet
2021 07 26.13.04.19 CS8080 Information Retrieval Techniques Reg 2017 Question Bank
6 pages

1-Introduction-MIR

Uploaded by

1-Introduction-MIR

Uploaded by

Course overview and introduction

CE-324: Modern Information Retrieval

• Office hours: set appointment through email.

• Head TA: Mahdi Ghaznavi ([email protected])

• Introduction to Information Retrieval, C.D. Manning, P.

• 3-4 project assignments (practical)

• Projects are implementation-heavy

• Language of choice: Python

• Everyone gets up to 6 total slack days

• You can distribute them across your projects expect to

• Once you use up your slack days, all subsequent late

• We follow the CE Department Honor Code – read it

• Information Retrieval (IR) is finding material (usually

• Retrieving relevant documents to a query (while retrieving as

– These days we frequently think first of web search, but

• Corpus or collection: a set of documents

• Information need: information required by the user

• Query: formulation of the information need

• Problem: Semantic gap between query and docs

• Solution: IR system must interpret and rank docs

• Search overhead: Time spent in all steps leading to the

• The amount of online data has grown at least as quickly

} Term (index unit): A word, phrase, and other groups of

• Indexing constructs an index for a corpus of docs.

• Query Operations transform the query to improve

• Searching retrieves docs that are related to the query.

• Ranking retrieved documents according to their

• User Interface manages interaction with the user:

• Unstructured text (free text): a continuous sequence of

• Structured text (fielded text): text is broken into fields

• Structured: data tends to refer to information in “tables”

Typically allows numerical range and exact match

• In fact almost no data is “unstructured”

• Facilitates “semi-structured” search such as

… to say nothing of linguistic structure

• Recall: Fraction of relevant docs that are retrieved

• Assume that there are 8 relevant docs to the query 𝑄.

• Web collection properties

You might also like