0% found this document useful (0 votes)
12 views

1-Introduction-MIR

The document outlines the course CE-324: Modern Information Retrieval at Sharif University of Technology, taught by Mahdieh Soleymani in Spring 2024. It includes course information, communication methods, marking schemes, project details, and an overview of information retrieval concepts and challenges. Key topics covered in the course include indexing, IR models, web IR, and advanced learning techniques in information retrieval.

Uploaded by

aidin.zaeim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

1-Introduction-MIR

The document outlines the course CE-324: Modern Information Retrieval at Sharif University of Technology, taught by Mahdieh Soleymani in Spring 2024. It includes course information, communication methods, marking schemes, project details, and an overview of information retrieval concepts and challenges. Key topics covered in the course include indexing, IR models, web IR, and advanced learning techniques in information retrieval.

Uploaded by

aidin.zaeim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Course overview and introduction

CE-324: Modern Information Retrieval


Sharif University of Technology

M. Soleymani
Spring 2024

Some slides have been adapted from: Profs. Manning & Nayak
lectures (CS-276, Stanford)
Course info
• Instructors: Mahdieh Soleymani
• Email: [email protected]

• Office hours: set appointment through email.

• Head TA: Mahdi Ghaznavi ([email protected])

2
Communication
• Quera
• Policies and rules
• Tentative schedule
• Slides and notes
• Projects
• Discussion

• Email
• Private questions

3
Text book

• Introduction to Information Retrieval, C.D. Manning, P.


Raghavan and H. Schuetze, Cambridge University Press, 2008.
• Free online version is available at: http://informationretrieval.org/
• Papers

4
Marking scheme

• Midterm: 20%
• Final Exam: 25%
• Quizzes: 10%
• Project (multiple phases): 45%

5
About homework assignments

• 3-4 project assignments (practical)

• Projects are implementation-heavy

• Language of choice: Python

6
Projects: Late policy

• Everyone gets up to 6 total slack days

• You can distribute them across your projects expect to


the last project for that you are now allowed any slack
day

• Once you use up your slack days, all subsequent late


submissions will accrue a 10% penalty for each day (on
top of any other penalties)

7
Collaboration policy

• We follow the CE Department Honor Code – read it


carefully.
• Don’t look at code of others; everything you submit
should be your own work
• Don’t share your solution or code with others although
discussing general ideas is fine and encouraged
• Indicate in your submissions anyone you worked with

8
Typical IR system
} Given: corpus & user query
} Find:A ranked set of docs relevant to the query.
Document
Corpus: A collection of documents
corpus

Query IR System

A list of
Ranked
Documents
9
Information Retrieval (IR)

• Information Retrieval (IR) is finding material (usually


documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections [IIR
Book].

• Retrieving relevant documents to a query (while retrieving as


few non-relevant documents as possible)
• especially from large sets of documents efficiently.

10
Information Retrieval (IR)

– These days we frequently think first of web search, but


there are many other cases:
• Corporate search engines
• E-mail search
• Searching your laptop
• Legal information retrieval

11
12
13
14
15
16
Basic Definitions
• Document: a unit decided to build a retrieval system
over
• textual: a sequence of words, punctuation, etc that express
ideas about some topic in a natural language.

• Corpus or collection: a set of documents

• Information need: information required by the user


about some topics

• Query: formulation of the information need

17
Heuristic nature of IR

• Problem: Semantic gap between query and docs


• A doc is relevant if the user perceives that this doc contains his
information need
• How to extract information from docs and how to use it to
decide relevance

• Solution: IR system must interpret and rank docs


according to the amount of relevance to the user’s query.
• “The notion of relevance is at the center of IR.”

18
Minimize search overhead

• Search overhead: Time spent in all steps leading to the


reading of items containing the needed information
• Steps: query generation, query execution, scanning results,
reading non-relevant items, etc.

• The amount of online data has grown at least as quickly


as the speed of computers

19
Condensing the data (indexing)
} Indexing the corpus to speed up the searching task
} Using the index instead of linearly scanning the docs that is
computationally expensive for large collections
} Indexing depends on the query language and IR model

} Term (index unit): A word, phrase, and other groups of


symbols used for retrieval
} Index terms are useful for remembering the document themes

20
Typical IR system architecture
Text
User
Interface
user need Text

Text Operations

Query
Indexing
user feedback Operations Corpus

query

Searching Index

retrieved docs

Ranking
ranked docs
21
IR system components
• Text Operations forms index terms
• Tokenization, stop word removal, stemming, …

• Indexing constructs an index for a corpus of docs.

• Query Operations transform the query to improve


retrieval:
• Query expansion using a thesaurus or query transformation
using relevance feedback

• Searching retrieves docs that are related to the query.

22
IR system components (continued)

• Ranking retrieved documents according to their


relevance.

• User Interface manages interaction with the user:


• Query input and visualization of results
• Relevance feedback

23
Structured vs. unstructured docs

• Unstructured text (free text): a continuous sequence of


tokens

• Structured text (fielded text): text is broken into fields


that are distinguished by tags or other markup

• Semi-structured text
• e.g. web page

24
Databases vs. IR:
Structured vs. unstructured data

• Structured: data tends to refer to information in “tables”


Student Name Student ID Supervisor GPA
Name
Smith 20116671 Joes 12
Joes 20114190 Chang 14.1
Lee 20095900 Chang 19

Typically allows numerical range and exact match


(for text) queries, e.g.,
GPA < 16 AND Supervisor = Chang.

25
Semi-structured data

• In fact almost no data is “unstructured”


• E.g., this slide has distinctly identified zones such as the Title
and Bullets

• Facilitates “semi-structured” search such as


• Title contains data AND Bullets contain search

… to say nothing of linguistic structure

26
Unstructured (text) vs. structured (database)
data in the mid-nineties

250

200

150
Unstructured
100 Structured

50

0
Data volume Market Cap

27
Unstructured (text) vs. structured (database)
data today

250

200

150

100

50

0
Data volume Market Cap

28
Data retrieval vs. information retrieval
• Data retrieval
• which items contain a set of keywords? Or satisfy the given
(e.g., regular expression like) user query?
• well defined structure and semantics
• a single erroneous object implies failure!

• Information retrieval
• information about a subject
• semantics is frequently loose (natural language is not well
structured and may be ambiguous)
• small errors are tolerated

29
Sec. 1.1

Evaluation of results
• Precision: Fraction of retrieved docs that are relevant to
user’s information need
 Precision = relevant retrieved / total retrieved
= |Retrieved Ç Relevant | / |Retrieved |

• Recall: Fraction of relevant docs that are retrieved


 Recall = relevant retrieved / relevant exist
= |Retrieved Ç Relevant | / | Relevant |
Retrieved & Relevant

Retrieved Relevant

30
Example

• Assume that there are 8 relevant docs to the query 𝑄.


• List of the retrieved docs for 𝑄 :
• d1: R
3
• d2: NR 𝑃=
7
• d3: R
3
• d4: R 𝑅=
8
• d5: NR
• d6: NR
• d7: NR

31
Web Search
• Application of IR to (HTML) documents on the World
Wide Web.

• Web IR
• collect doc corpus by crawling the web
• exploit the structural layout of docs
• Beyond terms, exploit the link structure (ideas from
social networks)
• link analysis, clickstreams ...

32
Web IR

Web

Crawler corpus

Query IR System

A list of
Ranked Pages

33
The web and its challenges

• Web collection properties


• Distributed nature of the web collection
• Size of the collection and volume of the user queries
• Web advertisement (web is a medium for business too)
• Predicting relevance on the web
• Docs change uncontrollably (dynamic and volatile data )
• Unusual and diverse (heterogeneous) docs, users, and
queries

34
Course main topics
• Indexing & text operations
• IR Models
• Boolean, vector space, probabilistic
• Evaluation of IR systems
• Web IR
• Crawling
• Duplication removal
• Link-based algorithms
• Learning in IR:
• Classification & clustering
• Learning to rank
• (Distributed) word representation
• NNs and deep embedding models
• LLMs & RAG
• Some advanced topics
35

You might also like