1-Introduction-MIR
1-Introduction-MIR
M. Soleymani
Spring 2024
Some slides have been adapted from: Profs. Manning & Nayak
lectures (CS-276, Stanford)
Course info
• Instructors: Mahdieh Soleymani
• Email: [email protected]
2
Communication
• Quera
• Policies and rules
• Tentative schedule
• Slides and notes
• Projects
• Discussion
• Email
• Private questions
3
Text book
4
Marking scheme
• Midterm: 20%
• Final Exam: 25%
• Quizzes: 10%
• Project (multiple phases): 45%
5
About homework assignments
6
Projects: Late policy
7
Collaboration policy
8
Typical IR system
} Given: corpus & user query
} Find:A ranked set of docs relevant to the query.
Document
Corpus: A collection of documents
corpus
Query IR System
A list of
Ranked
Documents
9
Information Retrieval (IR)
10
Information Retrieval (IR)
11
12
13
14
15
16
Basic Definitions
• Document: a unit decided to build a retrieval system
over
• textual: a sequence of words, punctuation, etc that express
ideas about some topic in a natural language.
17
Heuristic nature of IR
18
Minimize search overhead
19
Condensing the data (indexing)
} Indexing the corpus to speed up the searching task
} Using the index instead of linearly scanning the docs that is
computationally expensive for large collections
} Indexing depends on the query language and IR model
20
Typical IR system architecture
Text
User
Interface
user need Text
Text Operations
Query
Indexing
user feedback Operations Corpus
query
Searching Index
retrieved docs
Ranking
ranked docs
21
IR system components
• Text Operations forms index terms
• Tokenization, stop word removal, stemming, …
22
IR system components (continued)
23
Structured vs. unstructured docs
• Semi-structured text
• e.g. web page
24
Databases vs. IR:
Structured vs. unstructured data
25
Semi-structured data
26
Unstructured (text) vs. structured (database)
data in the mid-nineties
250
200
150
Unstructured
100 Structured
50
0
Data volume Market Cap
27
Unstructured (text) vs. structured (database)
data today
250
200
150
100
50
0
Data volume Market Cap
28
Data retrieval vs. information retrieval
• Data retrieval
• which items contain a set of keywords? Or satisfy the given
(e.g., regular expression like) user query?
• well defined structure and semantics
• a single erroneous object implies failure!
• Information retrieval
• information about a subject
• semantics is frequently loose (natural language is not well
structured and may be ambiguous)
• small errors are tolerated
29
Sec. 1.1
Evaluation of results
• Precision: Fraction of retrieved docs that are relevant to
user’s information need
Precision = relevant retrieved / total retrieved
= |Retrieved Ç Relevant | / |Retrieved |
Retrieved Relevant
30
Example
31
Web Search
• Application of IR to (HTML) documents on the World
Wide Web.
• Web IR
• collect doc corpus by crawling the web
• exploit the structural layout of docs
• Beyond terms, exploit the link structure (ideas from
social networks)
• link analysis, clickstreams ...
32
Web IR
Web
Crawler corpus
Query IR System
A list of
Ranked Pages
33
The web and its challenges
34
Course main topics
• Indexing & text operations
• IR Models
• Boolean, vector space, probabilistic
• Evaluation of IR systems
• Web IR
• Crawling
• Duplication removal
• Link-based algorithms
• Learning in IR:
• Classification & clustering
• Learning to rank
• (Distributed) word representation
• NNs and deep embedding models
• LLMs & RAG
• Some advanced topics
35