Lecture 1 - Introduction
Lecture 1 - Introduction
Lecture 1. Introduction
Information Retrieval and Analysis
Vasily Sidorov
Today’s Outline
• Course information
—Why learn IR?
• Walkthrough of the components in a modern IR
system
• Search engine evolution
• Overview of Google search engine (1998)
—Architecture
—PageRank
Course Instructor (me)
4
Textbook
• Introduction to Information
Retrieval
—Christopher D. Manning
—Prabhakar Raghavan
—Hinrich Schutze
• Available online at:
—http://nlp.stanford.edu/IR-book/
5
Reference Books
Modern Information Information
Retrieval: The Retrieval:
Concepts and Implementing and
Technology behind Evaluating Search
Search (2nd Edition) Engines
6
Pre-requisites
• Mathematics
—Linear algebra: vectors, matrices, matrix inverse
—Probability theory basics
◦ P(AB)=P(A)P(B)
• Hands-on Attitude
—Embrace new technology and ideas
• Programming
—Any programming language will do
◦ Python, Java, C#, C/C++, JS …
—Work with files
—Text encodings (ASCII, UTF, …)
—Algorithms
—Data structures
7
After taking this course, you’ll know
• How to build your own search engine, or customize
an existing text search engine
• How to enhance applications using IR, e.g.,
—Cluster text-like information such as microarray data
—Find similar actions / data / objects
—Parse/analyze text/dialogues (e.g., Facebook posts,
Twitter, comments)
• How to build your own NextGen IR killer-app
—e.g., matching people based on their preferences
—limited only by your imagination!
8
This course does NOT cover
• Non-text data Retrieval
—Image
—Video
• XML Retrieval and NoSQL databases
• Natural Language Processing
—Ontologies, e.g., WORDNET, HOWNET
—Part of Speech Tagging, Grammar, Parsing, …
—GPT-3 and other text GANs
• Structured Data Retrieval
—SQL
9
Let’s start
• What is IR?
—IR = Integrated Resort? Internet Relay?
—IR = Information Retrieval
• What to retrieve?
—bookmarks like del.icio.us
—people, like LinkedIn, Facebook
—books (in library or on Amazon)
—text (web pages, medical reports, assignment
reports)
—images (photos, Flickr)
—videos (movies, YouTube)
• IR vs. Text Mining
10
What is Text Mining?
• “The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
— Grobelnik et al., 2001
11
Text Mining Challenges
• Data collection is “free text”
—Not well-organized; Semi-structured
or unstructured
• Natural language text contains
ambiguities on many levels
—Lexical, syntactic, semantic, and
pragmatic, e.g.,
Time flies like an arrow.
Fruit flies like a banana.
• Learning techniques for processing text typically
need annotated training examples
12
Text Mining Research Areas
• Information Retrieval (IR)
—Search Engines
—Classification
—Recommendation
• Information Understanding
—Natural Language Processing (NLP)
—Question Answering
—Concept Extraction from Newsgroup
—Visualization, Summarization
• Trend Detection
—Outlier Detection
—Event Detection
13
Why Learn IR?
• Understand limitations of state-of-the-art IR
—Learn what is possible in IR, tell Fiction from Fact
—Learn how to fool IR systems?
• Organize your personal information
—Master/create IR software to manage personal
information
• How to use Search Engines better
• Design next generation IR system!
—Be the next Google (not necessary search engine)
—Yahoo, Google, [Your Company]
14
How to Retrieve Information
• Example
—Scan through every book in library/store bookshelf
—View every image/video
• To speed up IR:
—Must scan every piece of information before
retrieving
◦ Google/Bing/etc. try to download the entire Web
—Indexing = Scan everything = remember where each
information is located
◦ “1984” located at Level 2 Shelf 34 of National Library
◦ List of documents containing “1984” stored on disk C:\
15
History
• 300 BC, Great Library of Alexandria, Egypt
—Most books were stored in armaria (closed, labelled
cupboards). Armaria were used for book storage till
medieval times.
16
Libraries Before Computers
• Cataloging
—A process of describing a document (both physical
attributes & subject contents)
—Catalog = key to a collection
• Bibliographic record
—A surrogate record produced by catalogers according
to defined standards (e.g., Machine Readable
Cataloging record)
• Subject Classification
—Allocating a classification number
17
Classical Indexing
• Indexing
—Human librarians construct document surrogates by
assigning identifiers to text items.
• Includes
—Keyword Indexing
◦ Similar to today's Search Engine Index
—Subject Indexing
◦ Similar to today's Classification Engine
18
Subject Indexing - Classification
• Hierarchical structure
—Similar subjects at the same
Furniture
level
Chairs Tables
• Goals of Classification
—Collocate subjects
◦ group all documents of same subject together on
shelves & put them next to related subjects.
—Define & Assign code (Call Number) to document
◦ to facilitate identification from the catalogue and to
shelf location
19
Dewey Decimal Classification (DDC)
22
Classical Indexing
The Natural Language problem:
• Low consistency:
—People use different words to refer to same things
—People use same words to refer to different things
• Objective in IR:
—Search & retrieval of documents (or records) require
some level of intellectual control over the item and
its contents, at the same time, recognizing the need
for flexibility
23
Classical Indexing
• Keyword indexing (Google)
—Index entries generated
from the title and/or
keywords from the text.
—No intellectual process of
text analysis or abstraction
• Subject indexing (Yahoo)
—Involves analysis of the subject by humans /
computers
24
Classical Indexing Problems
• Effectiveness of indexing depends on:
—Indexing Exhaustiveness
◦ extent to which the subject matter of a given
document has been reflected through the index entries
—Term Specificity
◦ how broad/specific are the terms/keywords
25
Vocabulary Control: Controlled vs Natural
language indexing
• Controlled language
—Use of vocabulary control tool in indexing
—Semantic Web
—Dublin Core
—XML Ontologies
• Natural language (free text)
—Any term in the document may be an index term.
No mechanism controls the indexing process
—Modern Search Engine
26
Who wins?
27
A Modern IR System (A Search Engine)
• Crawler
• Indexer
• Searcher
28
Basic Concepts: Tokenization
• Assign unique id to each word & keep in a lexicon
• HTML tags can be
—Ignored or
—Used to assign more weight to important items such
as <title>
29
Basic concepts: Stop/Noise Words
Removal
High Frequency Words that carry little information:
30
Basic concepts: Stemming (Roman
Languages)
• Useful for
—Reducing # Words (Dimensionality)
—Machine Translation working
◦ Morphology works work
• Performance Improvement worked
31
Porter Stemmer
• Porter Stemmer (Porter 1980)
—http://www.tartarus.org/~martin/PorterStemmer/
—http://snowball.tartarus.org/
—Simple algorithms to determine which affixes to
strip in which order and when to apply repair
strategies
Input Stripped “ed” affix Repair
hoped hop hope (add e if word is short)
32
Porter Stemmer
consigned consign knack knack
Sample output: consignment consign knackeries knackeri
consolation consol knaves knavish
consolatory consolatori knavish knavish
consolidate consolid knif knif
consolidating consolid knife knife
Errors consoling consol knew knew
• Conflation:
—reply, rep. → rep
• Overstemming:
—wander → wand
—news → new
• Mis-stemming:
—relativity → relative
• Understemming:
—knavish → knavish (knave)
33
Stemmer vs. Dictionary
• Stemming Rules more efficient than a dictionary
—Algorithmic stemmers can be fast (and lean): 1
Million words in 6 seconds on a 500 MHz PC
• No maintenance even if things change
• Better to ignore irregular forms (exceptions) than to
complicate the algorithm
—not much lost in practice
—80/20 Rule
34
Basic Concepts: Phrase Detection
• Important for English
—New York City Police Department
—Bill Gates spoke on the benefits of Windows
• Essential for CJK (Chinese, Japanese, Korean)
—新加坡是个美丽的城市
[Singapore is a beautiful city]
• Approaches
—Dictionary Based
◦ Most Accurate; Needs maintenance (by humans)
—Learnt/Extracted from Corpus
◦ Hidden Markov Model; N-Grams; Statistical Analysis
◦ Suffix Tree Phrase Detection (via statistical counting)
35
Phrases Extracted from News Dataset
south san diego united high
south africa san diego ca united kingdom high density
south africa internet san francisco united nations high end
south africa po san francisco bay united nations quite high enough
south african san francisco chronicle united states high frequency
south african government san francisco giants united states attempt high hockey
south african intelligence san francisco police united states code high just
south african libertarian san francisco police inspector united states government high level
south america san francisco police inspector ron united states holocaust high performance
south atlantic san francisco police intelligence united states officially high power
south dakota san francisco police intelligence unit united states senate high quality
south dakota writes san francisco police officer high ranking
south georgia san jose high ranking crime
south georgia island san jose ca high ranking initiate
south pacific san jose mercury high resolution
south pacific island high school
high school students
high speed
high speed collision
high sticking
high tech
high voltage
highend
higher 36
Vector Space Text Representation
• Bag of Words (BOW) Model
—Order/Position of word/term unimportant
37
Basic Concepts: Weighing the Terms
• Which of these tells you more about a doc?
—10 occurrences of pizza?
—10 occurrences of the?
• Look at
—Collection frequency (CF)
—Document frequency (DF), which is better:
Word CF DF
TRY 10 422 8 760
INSURANCE 10 440 3 997
39
Basic Concepts: TFIDF
• Assign a TFIDF weight to each term 𝑖 in each
document 𝑑
𝑁
𝑤𝑖,𝑑 = TF𝑖,𝑑 × log
DF𝑖
40
Documents as Vectors
• Each document 𝑑 can now be viewed as a vector of
TF × IDF values, one component for each term
• So we have a vector space
—terms are axes
—docs live in this space
—even with stemming, may have 100,000+
dimensions
• The corpus of documents gives us a matrix, which
we could also view as a vector space in which words
live
41
Why turn documents into vectors
• First application: Query-by-example
—Given a doc/query 𝑑, find others “like” it (or most
similar to it)
• Now that 𝑑 is a vector, find vectors (docs) “near” it.
42
What is not IR
• Why not a SQL query?
—IR ≠ DB query
—IR ≠ XML
• Data → Query
—IR = unstructured
—XML = semi-structured
—DB = structured
43
Search Engine Evolution
• 1st generation (use only “on page” data) 1995–1997
AltaVista, Excite,
—Text data, word frequency, language Lycos, …
45
2nd Generation Search Engine
• Ranking — use off-page, web-specific data
—Link (or connectivity) analysis
—Click-through data (what results people click on)
—Anchor-text (how people refer to this page)
• Link Analysis
—Idea: mine hyperlink information on the Internet
—Assumptions:
◦ Links often connect related pages
◦ A link between pages is a recommendation
▪ “people vote with their links”
46
3rd Generation Search Engine
• Query language determination
—if query is in Japanese then do not return English
• Different ranking
• Hard & soft matches
—Personalities (triggered on names)
—Cities (travel info, maps)
—Medical info (triggered on names and/or results)
—Stock quotes, news (triggered on stock symbol)
—Company info, …
• Integration of Search and Text Analysis
47
3rd Generation Search Engine (cont’d)
• Context determination
—where: spatial (user location/target location)
—when: query stream (previous queries)
—who: personalization (user profile)
—explicit (family friendly)
—implicit (use google.com.sg or google.fr)
• Context use
—Result restriction
—Ranking modulation
48
History: Google Architecture (circa 1998)
Implemented in C/C++ on Linux and off-the-shelf PCs
49
For Comparison, Google Today
50
Google Architecture (c. 1998): Crawler
51
Google Architecture (c. 1998): Indexer
52
Google Architecture (c. 1998): Searcher
53
Google’s Algorithm
• Imagine a browser randomly walking on the 1ൗ
3
Internet: 1ൗ
3
1ൗ
—Start at a random page 3
54
1ൗ 1ൗ
10 10
1ൗ 1ൗ
Google Teleporting 1ൗ
10
10
1ൗ
10
10
1ൗ 1ൗ
10 10
1ൗ
• At each visit, 1ൗ
10
10
55
Anchor Text
• Associate anchor text of a link to the page it points
to
• Advantages:
—Links provide more accurate description
—Can index documents that text-based search engine
cannot (e.g. Google Image Search)
56
Key Google Optimization Techniques
• Each crawler maintains its local DNS lookup cache
• Parallelization of indexing phase
• In-memory lexicon
• Compression of repository
• Compact encoding of hitlists accounting for major space
savings
• Indexer is optimized so it is just faster than the crawler
(bottleneck)
• Document index updated in bulk
• Critical data structures placed on local disk
• Overall architecture designed to avoid disk seeks
wherever possible
57
Google Storage Requirements (c. 1998)
• Total disk space approx, 106 GB
—standard PC hard disk approx. 10 GB in 1998
Price of 1 GB
1981 $ 300 000
1987 $ 50 000
1990 $ 10 000
1994 $ 1 000
1997 $ 100
2000 $ 10
2004 $ 1
2012 $ 0.1
2017 $ 0.03
58