0% found this document useful (0 votes)
3 views

Lecture 1 - Introduction

The document outlines a lecture on Information Retrieval (IR) led by Dr. Vasily Sidorov, covering course information, components of modern IR systems, and the evolution of search engines. It discusses prerequisites, learning outcomes, and the distinction between IR and text mining, while also highlighting challenges and research areas in text mining. Additionally, it introduces classical indexing methods, vocabulary control, and basic concepts such as tokenization and TF-IDF in the context of search engines.

Uploaded by

alexiesourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 1 - Introduction

The document outlines a lecture on Information Retrieval (IR) led by Dr. Vasily Sidorov, covering course information, components of modern IR systems, and the evolution of search engines. It discusses prerequisites, learning outcomes, and the distinction between IR and text mining, while also highlighting challenges and research areas in text mining. Additionally, it introduces classical indexing methods, vocabulary control, and basic concepts such as tokenization and TF-IDF in the context of search engines.

Uploaded by

alexiesourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CI-6226

Lecture 1. Introduction
Information Retrieval and Analysis

Vasily Sidorov
Today’s Outline
• Course information
—Why learn IR?
• Walkthrough of the components in a modern IR
system
• Search engine evolution
• Overview of Google search engine (1998)
—Architecture
—PageRank
Course Instructor (me)

• Dr. Vasily Sidorov


• SCSE, NTU
• Questions, feedback, anything
[email protected]

4
Textbook
• Introduction to Information
Retrieval
—Christopher D. Manning
—Prabhakar Raghavan
—Hinrich Schutze
• Available online at:
—http://nlp.stanford.edu/IR-book/

5
Reference Books
Modern Information Information
Retrieval: The Retrieval:
Concepts and Implementing and
Technology behind Evaluating Search
Search (2nd Edition) Engines

Managing Search Engines:


Gigabytes: Information
Compressing and Retrieval in Practice
Indexing Documents
and Images, Second
Edition

6
Pre-requisites
• Mathematics
—Linear algebra: vectors, matrices, matrix inverse
—Probability theory basics
◦ P(AB)=P(A)P(B)
• Hands-on Attitude
—Embrace new technology and ideas
• Programming
—Any programming language will do
◦ Python, Java, C#, C/C++, JS …
—Work with files
—Text encodings (ASCII, UTF, …)
—Algorithms
—Data structures

7
After taking this course, you’ll know
• How to build your own search engine, or customize
an existing text search engine
• How to enhance applications using IR, e.g.,
—Cluster text-like information such as microarray data
—Find similar actions / data / objects
—Parse/analyze text/dialogues (e.g., Facebook posts,
Twitter, comments)
• How to build your own NextGen IR killer-app
—e.g., matching people based on their preferences
—limited only by your imagination!

8
This course does NOT cover
• Non-text data Retrieval
—Image
—Video
• XML Retrieval and NoSQL databases
• Natural Language Processing
—Ontologies, e.g., WORDNET, HOWNET
—Part of Speech Tagging, Grammar, Parsing, …
—GPT-3 and other text GANs
• Structured Data Retrieval
—SQL

9
Let’s start
• What is IR?
—IR = Integrated Resort? Internet Relay?
—IR = Information Retrieval
• What to retrieve?
—bookmarks like del.icio.us
—people, like LinkedIn, Facebook
—books (in library or on Amazon)
—text (web pages, medical reports, assignment
reports)
—images (photos, Flickr)
—videos (movies, YouTube)
• IR vs. Text Mining

10
What is Text Mining?
• “The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
— Grobelnik et al., 2001

• “Another way to view text data mining is as a process of


exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known”
— Hearst, 1999

11
Text Mining Challenges
• Data collection is “free text”
—Not well-organized; Semi-structured
or unstructured
• Natural language text contains
ambiguities on many levels
—Lexical, syntactic, semantic, and
pragmatic, e.g.,
Time flies like an arrow.
Fruit flies like a banana.
• Learning techniques for processing text typically
need annotated training examples

12
Text Mining Research Areas
• Information Retrieval (IR)
—Search Engines
—Classification
—Recommendation

• Information Extraction (IE)


—Product Information (e.g. price) scraping
—Named entity recognition

• Information Understanding
—Natural Language Processing (NLP)
—Question Answering
—Concept Extraction from Newsgroup
—Visualization, Summarization

• Cross-Lingual Text Mining

• Trend Detection
—Outlier Detection
—Event Detection

13
Why Learn IR?
• Understand limitations of state-of-the-art IR
—Learn what is possible in IR, tell Fiction from Fact
—Learn how to fool IR systems?
• Organize your personal information
—Master/create IR software to manage personal
information
• How to use Search Engines better
• Design next generation IR system!
—Be the next Google (not necessary search engine)
—Yahoo, Google, [Your Company]

14
How to Retrieve Information
• Example
—Scan through every book in library/store bookshelf
—View every image/video
• To speed up IR:
—Must scan every piece of information before
retrieving
◦ Google/Bing/etc. try to download the entire Web
—Indexing = Scan everything = remember where each
information is located
◦ “1984” located at Level 2 Shelf 34 of National Library
◦ List of documents containing “1984” stored on disk C:\

15
History
• 300 BC, Great Library of Alexandria, Egypt
—Most books were stored in armaria (closed, labelled
cupboards). Armaria were used for book storage till
medieval times.

16
Libraries Before Computers
• Cataloging
—A process of describing a document (both physical
attributes & subject contents)
—Catalog = key to a collection
• Bibliographic record
—A surrogate record produced by catalogers according
to defined standards (e.g., Machine Readable
Cataloging record)
• Subject Classification
—Allocating a classification number

17
Classical Indexing
• Indexing
—Human librarians construct document surrogates by
assigning identifiers to text items.
• Includes
—Keyword Indexing
◦ Similar to today's Search Engine Index
—Subject Indexing
◦ Similar to today's Classification Engine

18
Subject Indexing - Classification
• Hierarchical structure
—Similar subjects at the same
Furniture
level
Chairs Tables
• Goals of Classification
—Collocate subjects
◦ group all documents of same subject together on
shelves & put them next to related subjects.
—Define & Assign code (Call Number) to document
◦ to facilitate identification from the catalogue and to
shelf location

19
Dewey Decimal Classification (DDC)

• Most widely used


—Used by > 135 countries
• Translated into more than 30 languages
—Arabic, Chinese, French, Greek, Hebrew, Icelandic,
Russian, Spanish
• Universe of knowledge divided into 10 main classes
—Each class divided into 10 main divisions, …
—until all disciplines, subjects and concepts are
defined.
• Currently: 23rd edition (2011)
20
DDC Example
000 Computer science, information & general works
100 Philosophy & psychology
200 Religion 600 Technology (applied sciences)
300 Social sciences 610 Medical sciences
400 Language 620 Engineering and allied operations
500 Science 630 Agriculture and related technologies
600 Technology 640 Home economics and family living
700 Arts & recreation 650 Management and auxiliary services
800 Literature 660 Chemical engineering and related technologies
900 History & geography 670 Manufactures
680 Manufacture of products for specific uses
690 Buildings

620 Engineering & allied operations


621 Applied physics Another example
622 Mining and related operations
500 Natural sciences and mathematics
623 Military and nautical engineering
510 Mathematics
624 Civil engineering
516 Geometry
625 Engineering of railroads and
516.3 Analytic geometries
roads
516.37 Metric differential geometries
626 [not used]
516.375 Finsler geometry
627 Hydraulic engineering
628 Sanitary and municipal
engineering
21
629 Other branches of engineering
DDC is not ideal…
• DDC Classification Guidelines
—Determine the subject of a work
—Determine the disciplinary focus of a work
—Refer to the schedules
• Rules to handle a document in multiple classes
—First-of-two Rule: When two subjects receive equal
treatment, classify the work with the subject whose
number comes first in the schedules
—Rule of Application: Classify a work dealing with
interrelated subjects with the subject that is acted
upon

22
Classical Indexing
The Natural Language problem:
• Low consistency:
—People use different words to refer to same things
—People use same words to refer to different things
• Objective in IR:
—Search & retrieval of documents (or records) require
some level of intellectual control over the item and
its contents, at the same time, recognizing the need
for flexibility

23
Classical Indexing
• Keyword indexing (Google)
—Index entries generated
from the title and/or
keywords from the text.
—No intellectual process of
text analysis or abstraction
• Subject indexing (Yahoo)
—Involves analysis of the subject by humans /
computers

24
Classical Indexing Problems
• Effectiveness of indexing depends on:
—Indexing Exhaustiveness
◦ extent to which the subject matter of a given
document has been reflected through the index entries
—Term Specificity
◦ how broad/specific are the terms/keywords

25
Vocabulary Control: Controlled vs Natural
language indexing
• Controlled language
—Use of vocabulary control tool in indexing
—Semantic Web
—Dublin Core
—XML Ontologies
• Natural language (free text)
—Any term in the document may be an index term.
No mechanism controls the indexing process
—Modern Search Engine

26
Who wins?

27
A Modern IR System (A Search Engine)
• Crawler
• Indexer
• Searcher

28
Basic Concepts: Tokenization
• Assign unique id to each word & keep in a lexicon
• HTML tags can be
—Ignored or
—Used to assign more weight to important items such
as <title>

• Remove Stop/Noise words before/after tokenization

29
Basic concepts: Stop/Noise Words
Removal
High Frequency Words that carry little information:

a about all also among an and are as and same in


at be been between both but by other languages
click did do during for each
either had found from
further get

30
Basic concepts: Stemming (Roman
Languages)
• Useful for
—Reducing # Words (Dimensionality)
—Machine Translation working
◦ Morphology works work
• Performance Improvement worked

—No: Harman (1991), Krovetz (1993)


—Possibly: Krovetz (1995)
◦ Depends on type of text, and the assumption is that
once one moves beyond English, the difference will
prove significant
• Stemmer
—Porter Stemmer, k-stemmer

31
Porter Stemmer
• Porter Stemmer (Porter 1980)
—http://www.tartarus.org/~martin/PorterStemmer/
—http://snowball.tartarus.org/
—Simple algorithms to determine which affixes to
strip in which order and when to apply repair
strategies
Input Stripped “ed” affix Repair
hoped hop hope (add e if word is short)

hopped hopp hop (delete one if doubled)

• Samples of the algorithm accessible on the Internet


• Easy to understand and program

32
Porter Stemmer
consigned consign knack knack
Sample output: consignment consign knackeries knackeri
consolation consol knaves knavish
consolatory consolatori knavish knavish
consolidate consolid knif knif
consolidating consolid knife knife
Errors consoling consol knew knew
• Conflation:
—reply, rep. → rep
• Overstemming:
—wander → wand
—news → new
• Mis-stemming:
—relativity → relative
• Understemming:
—knavish → knavish (knave)

33
Stemmer vs. Dictionary
• Stemming Rules more efficient than a dictionary
—Algorithmic stemmers can be fast (and lean): 1
Million words in 6 seconds on a 500 MHz PC
• No maintenance even if things change
• Better to ignore irregular forms (exceptions) than to
complicate the algorithm
—not much lost in practice
—80/20 Rule

34
Basic Concepts: Phrase Detection
• Important for English
—New York City Police Department
—Bill Gates spoke on the benefits of Windows
• Essential for CJK (Chinese, Japanese, Korean)
—新加坡是个美丽的城市
[Singapore is a beautiful city]
• Approaches
—Dictionary Based
◦ Most Accurate; Needs maintenance (by humans)
—Learnt/Extracted from Corpus
◦ Hidden Markov Model; N-Grams; Statistical Analysis
◦ Suffix Tree Phrase Detection (via statistical counting)

35
Phrases Extracted from News Dataset
south san diego united high
south africa san diego ca united kingdom high density
south africa internet san francisco united nations high end
south africa po san francisco bay united nations quite high enough
south african san francisco chronicle united states high frequency
south african government san francisco giants united states attempt high hockey
south african intelligence san francisco police united states code high just
south african libertarian san francisco police inspector united states government high level
south america san francisco police inspector ron united states holocaust high performance
south atlantic san francisco police intelligence united states officially high power
south dakota san francisco police intelligence unit united states senate high quality
south dakota writes san francisco police officer high ranking
south georgia san jose high ranking crime
south georgia island san jose ca high ranking initiate
south pacific san jose mercury high resolution
south pacific island high school
high school students
high speed
high speed collision
high sticking
high tech
high voltage
highend
higher 36
Vector Space Text Representation
• Bag of Words (BOW) Model
—Order/Position of word/term unimportant

• Term Frequency (TF)


—Terms appear more frequently in a document is
considered more important in representing this
document

37
Basic Concepts: Weighing the Terms
• Which of these tells you more about a doc?
—10 occurrences of pizza?
—10 occurrences of the?
• Look at
—Collection frequency (CF)
—Document frequency (DF), which is better:
Word CF DF
TRY 10 422 8 760
INSURANCE 10 440 3 997

• CF/DF weighting possible only in known collection


38
Basic Concepts: TFIDF
• TF x IDF measure combines:
—Term frequency (TF): measure of term density in a doc
—Inverse document frequency (IDF): measure of
informativeness of term, rarity across whole corpus
—Could be raw count of # documents containing the term
1
IDF𝑖 =
DF𝑖
• By far the most commonly used version is:
𝑁
IDF𝑖 = log
DF𝑖
See Kishore Papineni, NAACL 2, 2002 for theoretical justification

39
Basic Concepts: TFIDF
• Assign a TFIDF weight to each term 𝑖 in each
document 𝑑
𝑁
𝑤𝑖,𝑑 = TF𝑖,𝑑 × log
DF𝑖

• Word is more important if


—The word appears more often in the document
—The word does not appear that often in the whole
corpus

40
Documents as Vectors
• Each document 𝑑 can now be viewed as a vector of
TF × IDF values, one component for each term
• So we have a vector space
—terms are axes
—docs live in this space
—even with stemming, may have 100,000+
dimensions
• The corpus of documents gives us a matrix, which
we could also view as a vector space in which words
live

41
Why turn documents into vectors
• First application: Query-by-example
—Given a doc/query 𝑑, find others “like” it (or most
similar to it)
• Now that 𝑑 is a vector, find vectors (docs) “near” it.

42
What is not IR
• Why not a SQL query?
—IR ≠ DB query
—IR ≠ XML

• Data → Query
—IR = unstructured
—XML = semi-structured
—DB = structured

43
Search Engine Evolution
• 1st generation (use only “on page” data) 1995–1997
AltaVista, Excite,
—Text data, word frequency, language Lycos, …

• 2nd generation (use off-page, web-specific data)


—Link (or connectivity) analysis 1998–present
Made popular by
—Click-through data (What people click) Google, used by
—Anchor-text (How people refer to this page) everyone now.

• 3rd generation (answer “the need behind the query”)


—Semantic analysis — what is this about? present day
Siri, Alexa, Cortana,
—Focus on user need, rather than on query Google Assistant,
—Context determination IBM Watson
44
1st Generation Search Engine
• Extended Boolean model
—Matches: exact, prefix, phrase, …
—Operators: AND, OR, NOT, NEAR, …
—Fields: TITLE:, URL:, HOST:, …
—AND is somewhat easier to implement, maybe
preferable as default for short queries
• Ranking
—TF-like factors: TF, explicit keywords, words in title,
explicit emphasis (headers), etc.
—IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language

45
2nd Generation Search Engine
• Ranking — use off-page, web-specific data
—Link (or connectivity) analysis
—Click-through data (what results people click on)
—Anchor-text (how people refer to this page)

• Link Analysis
—Idea: mine hyperlink information on the Internet
—Assumptions:
◦ Links often connect related pages
◦ A link between pages is a recommendation
▪ “people vote with their links”

46
3rd Generation Search Engine
• Query language determination
—if query is in Japanese then do not return English
• Different ranking
• Hard & soft matches
—Personalities (triggered on names)
—Cities (travel info, maps)
—Medical info (triggered on names and/or results)
—Stock quotes, news (triggered on stock symbol)
—Company info, …
• Integration of Search and Text Analysis

47
3rd Generation Search Engine (cont’d)
• Context determination
—where: spatial (user location/target location)
—when: query stream (previous queries)
—who: personalization (user profile)
—explicit (family friendly)
—implicit (use google.com.sg or google.fr)

• Context use
—Result restriction
—Ranking modulation

48
History: Google Architecture (circa 1998)
Implemented in C/C++ on Linux and off-the-shelf PCs

49
For Comparison, Google Today

50
Google Architecture (c. 1998): Crawler

51
Google Architecture (c. 1998): Indexer

52
Google Architecture (c. 1998): Searcher

53
Google’s Algorithm
• Imagine a browser randomly walking on the 1ൗ
3
Internet: 1ൗ
3
1ൗ
—Start at a random page 3

—At each step, go out of the current page along one


of the links on that page, equiprobably
—“In the steady state” each page has a long-term visit
rate - use this as the page’s score
• BUT: The web is full of dead-ends
—Random walk can get stuck in dead-ends
—No sense to talk about long-term visit rates

54
1ൗ 1ൗ
10 10
1ൗ 1ൗ
Google Teleporting 1ൗ
10
10
1ൗ
10
10
1ൗ 1ൗ
10 10
1ൗ
• At each visit, 1ൗ
10
10

—with probability 10%, jump to a random web page


—with remaining probability (90%), go out on a
random link
—If no out-link, stay put in this case
• Now cannot get stuck locally
—There is a long-term rate at which any page is visited
• Motivation:
—Pages cited from many places are IMPORTANT
—Pages cited from an important place are
IMPORTANT.

55
Anchor Text
• Associate anchor text of a link to the page it points
to
• Advantages:
—Links provide more accurate description
—Can index documents that text-based search engine
cannot (e.g. Google Image Search)

56
Key Google Optimization Techniques
• Each crawler maintains its local DNS lookup cache
• Parallelization of indexing phase
• In-memory lexicon
• Compression of repository
• Compact encoding of hitlists accounting for major space
savings
• Indexer is optimized so it is just faster than the crawler
(bottleneck)
• Document index updated in bulk
• Critical data structures placed on local disk
• Overall architecture designed to avoid disk seeks
wherever possible
57
Google Storage Requirements (c. 1998)
• Total disk space approx, 106 GB
—standard PC hard disk approx. 10 GB in 1998

Price of 1 GB
1981 $ 300 000
1987 $ 50 000
1990 $ 10 000
1994 $ 1 000
1997 $ 100
2000 $ 10
2004 $ 1
2012 $ 0.1
2017 $ 0.03

58

You might also like