0% found this document useful (0 votes)

3 views

Lecture 1 - Introduction

The document outlines a lecture on Information Retrieval (IR) led by Dr. Vasily Sidorov, covering course information, components of modern IR systems, and the evolution of search engines. It discusses prerequisites, learning outcomes, and the distinction between IR and text mining, while also highlighting challenges and research areas in text mining. Additionally, it introduces classical indexing methods, vocabulary control, and basic concepts such as tokenization and TF-IDF in the context of search engines.

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture 1 - Introduction

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

CI-6226

Lecture 1. Introduction
Information Retrieval and Analysis

Vasily Sidorov
Today’s Outline
• Course information
—Why learn IR?
• Walkthrough of the components in a modern IR
system
• Search engine evolution
• Overview of Google search engine (1998)
—Architecture
—PageRank
Course Instructor (me)

• Dr. Vasily Sidorov

• SCSE, NTU
• Questions, feedback, anything
—[email protected]

4
Textbook
• Introduction to Information
Retrieval
—Christopher D. Manning
—Prabhakar Raghavan
—Hinrich Schutze
• Available online at:
—http://nlp.stanford.edu/IR-book/

5
Reference Books
Modern Information Information
Retrieval: The Retrieval:
Concepts and Implementing and
Technology behind Evaluating Search
Search (2nd Edition) Engines

Managing Search Engines:

Gigabytes: Information
Compressing and Retrieval in Practice
Indexing Documents
and Images, Second
Edition

6
Pre-requisites
• Mathematics
—Linear algebra: vectors, matrices, matrix inverse
—Probability theory basics
◦ P(AB)=P(A)P(B)
• Hands-on Attitude
—Embrace new technology and ideas
• Programming
—Any programming language will do
◦ Python, Java, C#, C/C++, JS …
—Work with files
—Text encodings (ASCII, UTF, …)
—Algorithms
—Data structures

7
After taking this course, you’ll know
• How to build your own search engine, or customize
an existing text search engine
• How to enhance applications using IR, e.g.,
—Cluster text-like information such as microarray data
—Find similar actions / data / objects
—Parse/analyze text/dialogues (e.g., Facebook posts,
Twitter, comments)
• How to build your own NextGen IR killer-app
—e.g., matching people based on their preferences
—limited only by your imagination!

8
This course does NOT cover
• Non-text data Retrieval
—Image
—Video
• XML Retrieval and NoSQL databases
• Natural Language Processing
—Ontologies, e.g., WORDNET, HOWNET
—Part of Speech Tagging, Grammar, Parsing, …
—GPT-3 and other text GANs
• Structured Data Retrieval
—SQL

9
Let’s start
• What is IR?
—IR = Integrated Resort? Internet Relay?
—IR = Information Retrieval
• What to retrieve?
—bookmarks like del.icio.us
—people, like LinkedIn, Facebook
—books (in library or on Amazon)
—text (web pages, medical reports, assignment
reports)
—images (photos, Flickr)
—videos (movies, YouTube)
• IR vs. Text Mining

10
What is Text Mining?
• “The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
— Grobelnik et al., 2001

• “Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known”
— Hearst, 1999

11
Text Mining Challenges
• Data collection is “free text”
—Not well-organized; Semi-structured
or unstructured
• Natural language text contains
ambiguities on many levels
—Lexical, syntactic, semantic, and
pragmatic, e.g.,
Time flies like an arrow.
Fruit flies like a banana.
• Learning techniques for processing text typically
need annotated training examples

12
Text Mining Research Areas
• Information Retrieval (IR)
—Search Engines
—Classification
—Recommendation

• Information Extraction (IE)

—Product Information (e.g. price) scraping
—Named entity recognition

• Information Understanding
—Natural Language Processing (NLP)
—Question Answering
—Concept Extraction from Newsgroup
—Visualization, Summarization

• Cross-Lingual Text Mining

• Trend Detection
—Outlier Detection
—Event Detection

13
Why Learn IR?
• Understand limitations of state-of-the-art IR
—Learn what is possible in IR, tell Fiction from Fact
—Learn how to fool IR systems?
• Organize your personal information
—Master/create IR software to manage personal
information
• How to use Search Engines better
• Design next generation IR system!
—Be the next Google (not necessary search engine)
—Yahoo, Google, [Your Company]

14
How to Retrieve Information
• Example
—Scan through every book in library/store bookshelf
—View every image/video
• To speed up IR:
—Must scan every piece of information before
retrieving
◦ Google/Bing/etc. try to download the entire Web
—Indexing = Scan everything = remember where each
information is located
◦ “1984” located at Level 2 Shelf 34 of National Library
◦ List of documents containing “1984” stored on disk C:\

15
History
• 300 BC, Great Library of Alexandria, Egypt
—Most books were stored in armaria (closed, labelled
cupboards). Armaria were used for book storage till
medieval times.

16
Libraries Before Computers
• Cataloging
—A process of describing a document (both physical
attributes & subject contents)
—Catalog = key to a collection
• Bibliographic record
—A surrogate record produced by catalogers according
to defined standards (e.g., Machine Readable
Cataloging record)
• Subject Classification
—Allocating a classification number

17
Classical Indexing
• Indexing
—Human librarians construct document surrogates by
assigning identifiers to text items.
• Includes
—Keyword Indexing
◦ Similar to today's Search Engine Index
—Subject Indexing
◦ Similar to today's Classification Engine

18
Subject Indexing - Classification
• Hierarchical structure
—Similar subjects at the same
Furniture
level
Chairs Tables
• Goals of Classification
—Collocate subjects
◦ group all documents of same subject together on
shelves & put them next to related subjects.
—Define & Assign code (Call Number) to document
◦ to facilitate identification from the catalogue and to
shelf location

19
Dewey Decimal Classification (DDC)

• Most widely used

—Used by > 135 countries
• Translated into more than 30 languages
—Arabic, Chinese, French, Greek, Hebrew, Icelandic,
Russian, Spanish
• Universe of knowledge divided into 10 main classes
—Each class divided into 10 main divisions, …
—until all disciplines, subjects and concepts are
defined.
• Currently: 23rd edition (2011)
20
DDC Example
000 Computer science, information & general works
100 Philosophy & psychology
200 Religion 600 Technology (applied sciences)
300 Social sciences 610 Medical sciences
400 Language 620 Engineering and allied operations
500 Science 630 Agriculture and related technologies
600 Technology 640 Home economics and family living
700 Arts & recreation 650 Management and auxiliary services
800 Literature 660 Chemical engineering and related technologies
900 History & geography 670 Manufactures
680 Manufacture of products for specific uses
690 Buildings

620 Engineering & allied operations

621 Applied physics Another example
622 Mining and related operations
500 Natural sciences and mathematics
623 Military and nautical engineering
510 Mathematics
624 Civil engineering
516 Geometry
625 Engineering of railroads and
516.3 Analytic geometries
roads
516.37 Metric differential geometries
626 [not used]
516.375 Finsler geometry
627 Hydraulic engineering
628 Sanitary and municipal
engineering
21
629 Other branches of engineering
DDC is not ideal…
• DDC Classification Guidelines
—Determine the subject of a work
—Determine the disciplinary focus of a work
—Refer to the schedules
• Rules to handle a document in multiple classes
—First-of-two Rule: When two subjects receive equal
treatment, classify the work with the subject whose
number comes first in the schedules
—Rule of Application: Classify a work dealing with
interrelated subjects with the subject that is acted
upon

22
Classical Indexing
The Natural Language problem:
• Low consistency:
—People use different words to refer to same things
—People use same words to refer to different things
• Objective in IR:
—Search & retrieval of documents (or records) require
some level of intellectual control over the item and
its contents, at the same time, recognizing the need
for flexibility

23
Classical Indexing
• Keyword indexing (Google)
—Index entries generated
from the title and/or
keywords from the text.
—No intellectual process of
text analysis or abstraction
• Subject indexing (Yahoo)
—Involves analysis of the subject by humans /
computers

24
Classical Indexing Problems
• Effectiveness of indexing depends on:
—Indexing Exhaustiveness
◦ extent to which the subject matter of a given
document has been reflected through the index entries
—Term Specificity
◦ how broad/specific are the terms/keywords

25
Vocabulary Control: Controlled vs Natural
language indexing
• Controlled language
—Use of vocabulary control tool in indexing
—Semantic Web
—Dublin Core
—XML Ontologies
• Natural language (free text)
—Any term in the document may be an index term.
No mechanism controls the indexing process
—Modern Search Engine

26
Who wins?

27
A Modern IR System (A Search Engine)
• Crawler
• Indexer
• Searcher

28
Basic Concepts: Tokenization
• Assign unique id to each word & keep in a lexicon
• HTML tags can be
—Ignored or
—Used to assign more weight to important items such
as <title>

• Remove Stop/Noise words before/after tokenization

29
Basic concepts: Stop/Noise Words
Removal
High Frequency Words that carry little information:

a about all also among an and are as and same in

at be been between both but by other languages
click did do during for each
either had found from
further get

30
Basic concepts: Stemming (Roman
Languages)
• Useful for
—Reducing # Words (Dimensionality)
—Machine Translation working
◦ Morphology works work
• Performance Improvement worked

—No: Harman (1991), Krovetz (1993)

—Possibly: Krovetz (1995)
◦ Depends on type of text, and the assumption is that
once one moves beyond English, the difference will
prove significant
• Stemmer
—Porter Stemmer, k-stemmer

31
Porter Stemmer
• Porter Stemmer (Porter 1980)
—http://www.tartarus.org/~martin/PorterStemmer/
—http://snowball.tartarus.org/
—Simple algorithms to determine which affixes to
strip in which order and when to apply repair
strategies
Input Stripped “ed” affix Repair
hoped hop hope (add e if word is short)

hopped hopp hop (delete one if doubled)

• Samples of the algorithm accessible on the Internet

• Easy to understand and program

32
Porter Stemmer
consigned consign knack knack
Sample output: consignment consign knackeries knackeri
consolation consol knaves knavish
consolatory consolatori knavish knavish
consolidate consolid knif knif
consolidating consolid knife knife
Errors consoling consol knew knew
• Conflation:
—reply, rep. → rep
• Overstemming:
—wander → wand
—news → new
• Mis-stemming:
—relativity → relative
• Understemming:
—knavish → knavish (knave)

33
Stemmer vs. Dictionary
• Stemming Rules more efficient than a dictionary
—Algorithmic stemmers can be fast (and lean): 1
Million words in 6 seconds on a 500 MHz PC
• No maintenance even if things change
• Better to ignore irregular forms (exceptions) than to
complicate the algorithm
—not much lost in practice
—80/20 Rule

34
Basic Concepts: Phrase Detection
• Important for English
—New York City Police Department
—Bill Gates spoke on the benefits of Windows
• Essential for CJK (Chinese, Japanese, Korean)
—新加坡是个美丽的城市
[Singapore is a beautiful city]
• Approaches
—Dictionary Based
◦ Most Accurate; Needs maintenance (by humans)
—Learnt/Extracted from Corpus
◦ Hidden Markov Model; N-Grams; Statistical Analysis
◦ Suffix Tree Phrase Detection (via statistical counting)

35
Phrases Extracted from News Dataset
south san diego united high
south africa san diego ca united kingdom high density
south africa internet san francisco united nations high end
south africa po san francisco bay united nations quite high enough
south african san francisco chronicle united states high frequency
south african government san francisco giants united states attempt high hockey
south african intelligence san francisco police united states code high just
south african libertarian san francisco police inspector united states government high level
south america san francisco police inspector ron united states holocaust high performance
south atlantic san francisco police intelligence united states officially high power
south dakota san francisco police intelligence unit united states senate high quality
south dakota writes san francisco police officer high ranking
south georgia san jose high ranking crime
south georgia island san jose ca high ranking initiate
south pacific san jose mercury high resolution
south pacific island high school
high school students
high speed
high speed collision
high sticking
high tech
high voltage
highend
higher 36
Vector Space Text Representation
• Bag of Words (BOW) Model
—Order/Position of word/term unimportant

• Term Frequency (TF)

—Terms appear more frequently in a document is
considered more important in representing this
document

37
Basic Concepts: Weighing the Terms
• Which of these tells you more about a doc?
—10 occurrences of pizza?
—10 occurrences of the?
• Look at
—Collection frequency (CF)
—Document frequency (DF), which is better:
Word CF DF
TRY 10 422 8 760
INSURANCE 10 440 3 997

• CF/DF weighting possible only in known collection

38
Basic Concepts: TFIDF
• TF x IDF measure combines:
—Term frequency (TF): measure of term density in a doc
—Inverse document frequency (IDF): measure of
informativeness of term, rarity across whole corpus
—Could be raw count of # documents containing the term
1
IDF𝑖 =
DF𝑖
• By far the most commonly used version is:
𝑁
IDF𝑖 = log
DF𝑖
See Kishore Papineni, NAACL 2, 2002 for theoretical justification

39
Basic Concepts: TFIDF
• Assign a TFIDF weight to each term 𝑖 in each
document 𝑑
𝑁
𝑤𝑖,𝑑 = TF𝑖,𝑑 × log
DF𝑖

• Word is more important if

—The word appears more often in the document
—The word does not appear that often in the whole
corpus

40
Documents as Vectors
• Each document 𝑑 can now be viewed as a vector of
TF × IDF values, one component for each term
• So we have a vector space
—terms are axes
—docs live in this space
—even with stemming, may have 100,000+
dimensions
• The corpus of documents gives us a matrix, which
we could also view as a vector space in which words
live

41
Why turn documents into vectors
• First application: Query-by-example
—Given a doc/query 𝑑, find others “like” it (or most
similar to it)
• Now that 𝑑 is a vector, find vectors (docs) “near” it.

42
What is not IR
• Why not a SQL query?
—IR ≠ DB query
—IR ≠ XML

• Data → Query
—IR = unstructured
—XML = semi-structured
—DB = structured

43
Search Engine Evolution
• 1st generation (use only “on page” data) 1995–1997
AltaVista, Excite,
—Text data, word frequency, language Lycos, …

• 2nd generation (use off-page, web-specific data)

—Link (or connectivity) analysis 1998–present
Made popular by
—Click-through data (What people click) Google, used by
—Anchor-text (How people refer to this page) everyone now.

• 3rd generation (answer “the need behind the query”)

—Semantic analysis — what is this about? present day
Siri, Alexa, Cortana,
—Focus on user need, rather than on query Google Assistant,
—Context determination IBM Watson
44
1st Generation Search Engine
• Extended Boolean model
—Matches: exact, prefix, phrase, …
—Operators: AND, OR, NOT, NEAR, …
—Fields: TITLE:, URL:, HOST:, …
—AND is somewhat easier to implement, maybe
preferable as default for short queries
• Ranking
—TF-like factors: TF, explicit keywords, words in title,
explicit emphasis (headers), etc.
—IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language

45
2nd Generation Search Engine
• Ranking — use off-page, web-specific data
—Link (or connectivity) analysis
—Click-through data (what results people click on)
—Anchor-text (how people refer to this page)

• Link Analysis
—Idea: mine hyperlink information on the Internet
—Assumptions:
◦ Links often connect related pages
◦ A link between pages is a recommendation
▪ “people vote with their links”

46
3rd Generation Search Engine
• Query language determination
—if query is in Japanese then do not return English
• Different ranking
• Hard & soft matches
—Personalities (triggered on names)
—Cities (travel info, maps)
—Medical info (triggered on names and/or results)
—Stock quotes, news (triggered on stock symbol)
—Company info, …
• Integration of Search and Text Analysis

47
3rd Generation Search Engine (cont’d)
• Context determination
—where: spatial (user location/target location)
—when: query stream (previous queries)
—who: personalization (user profile)
—explicit (family friendly)
—implicit (use google.com.sg or google.fr)

• Context use
—Result restriction
—Ranking modulation

48
History: Google Architecture (circa 1998)
Implemented in C/C++ on Linux and off-the-shelf PCs

49
For Comparison, Google Today

50
Google Architecture (c. 1998): Crawler

51
Google Architecture (c. 1998): Indexer

52
Google Architecture (c. 1998): Searcher

53
Google’s Algorithm
• Imagine a browser randomly walking on the 1ൗ
3
Internet: 1ൗ
3
1ൗ
—Start at a random page 3

—At each step, go out of the current page along one

of the links on that page, equiprobably
—“In the steady state” each page has a long-term visit
rate - use this as the page’s score
• BUT: The web is full of dead-ends
—Random walk can get stuck in dead-ends
—No sense to talk about long-term visit rates

54
1ൗ 1ൗ
10 10
1ൗ 1ൗ
Google Teleporting 1ൗ
10
10
1ൗ
10
10
1ൗ 1ൗ
10 10
1ൗ
• At each visit, 1ൗ
10
10

—with probability 10%, jump to a random web page

—with remaining probability (90%), go out on a
random link
—If no out-link, stay put in this case
• Now cannot get stuck locally
—There is a long-term rate at which any page is visited
• Motivation:
—Pages cited from many places are IMPORTANT
—Pages cited from an important place are
IMPORTANT.

55
Anchor Text
• Associate anchor text of a link to the page it points
to
• Advantages:
—Links provide more accurate description
—Can index documents that text-based search engine
cannot (e.g. Google Image Search)

56
Key Google Optimization Techniques
• Each crawler maintains its local DNS lookup cache
• Parallelization of indexing phase
• In-memory lexicon
• Compression of repository
• Compact encoding of hitlists accounting for major space
savings
• Indexer is optimized so it is just faster than the crawler
(bottleneck)
• Document index updated in bulk
• Critical data structures placed on local disk
• Overall architecture designed to avoid disk seeks
wherever possible
57
Google Storage Requirements (c. 1998)
• Total disk space approx, 106 GB
—standard PC hard disk approx. 10 GB in 1998

Price of 1 GB
1981 $ 300 000
1987 $ 50 000
1990 $ 10 000
1994 $ 1 000
1997 $ 100
2000 $ 10
2004 $ 1
2012 $ 0.1
2017 $ 0.03

Instant Download Sexual textual politics feminist literary theory 2nd edition Edition Moi PDF All Chapters
100% (19)
Instant Download Sexual textual politics feminist literary theory 2nd edition Edition Moi PDF All Chapters
60 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
Information Retrieval Techniques(1)
No ratings yet
Information Retrieval Techniques(1)
59 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Informationa Retrival
No ratings yet
Informationa Retrival
22 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Ai & Ml Unit-3 Ir & Ie
No ratings yet
Ai & Ml Unit-3 Ir & Ie
15 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
ITR notes (2)
No ratings yet
ITR notes (2)
166 pages
Chapter 2
No ratings yet
Chapter 2
64 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
Cataloging
No ratings yet
Cataloging
53 pages
Ch1 Intro To Information Retrieval-Lina Nemri
No ratings yet
Ch1 Intro To Information Retrieval-Lina Nemri
23 pages
IRS_Unit_2
No ratings yet
IRS_Unit_2
15 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
29 pages
University of Gondar: Information Storage and Retrieval System
100% (1)
University of Gondar: Information Storage and Retrieval System
29 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
The AI Artificial Intelligence Course From Beginner to Expert
From Everand
The AI Artificial Intelligence Course From Beginner to Expert
Asomoo Ebooks
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Computer Science: The Complete Guide to Principles and Informatics
From Everand
Computer Science: The Complete Guide to Principles and Informatics
Jonathan Rigdon
No ratings yet
Data Structures Guide
From Everand
Data Structures Guide
Alisa Turing
No ratings yet
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Lecture 9 - Probabilistic Information Retrieval, Language Models
No ratings yet
Lecture 9 - Probabilistic Information Retrieval, Language Models
40 pages
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
30 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
fta image
No ratings yet
fta image
5 pages
4 CS V - VI Sem - 2018 - Syllabus FINAL
No ratings yet
4 CS V - VI Sem - 2018 - Syllabus FINAL
147 pages
gr-1-term-2-2021-psrip-efal-lesson-plan
No ratings yet
gr-1-term-2-2021-psrip-efal-lesson-plan
290 pages
Unit - 3 JavaScript
No ratings yet
Unit - 3 JavaScript
82 pages
Interrogative Pronoun
No ratings yet
Interrogative Pronoun
11 pages
EMOTIONS WORDSEARCH
No ratings yet
EMOTIONS WORDSEARCH
3 pages
نسخة UNLOCK WORKSHEET B2 U3 R&W
No ratings yet
نسخة UNLOCK WORKSHEET B2 U3 R&W
3 pages
The Song of The Ocean
No ratings yet
The Song of The Ocean
5 pages
Annexure 1
No ratings yet
Annexure 1
1 page
Power BI and SQL
No ratings yet
Power BI and SQL
5 pages
De Thi Hoc Ki 1 Tieng Anh Lop 7 de 1
No ratings yet
De Thi Hoc Ki 1 Tieng Anh Lop 7 de 1
5 pages
You Tube Links For ELA Segments On TTT SEA Time
No ratings yet
You Tube Links For ELA Segments On TTT SEA Time
3 pages
ch1 Notes
No ratings yet
ch1 Notes
116 pages
Abbreviation by Word
No ratings yet
Abbreviation by Word
4 pages
Thelemic Talismanic Magick
75% (4)
Thelemic Talismanic Magick
7 pages
Crystal Structure of The Elements
No ratings yet
Crystal Structure of The Elements
2 pages
RESUME Technical Support Engineer
No ratings yet
RESUME Technical Support Engineer
5 pages
MIC Assignment 3-22-23
No ratings yet
MIC Assignment 3-22-23
2 pages
Olympiad tasks in English for the 5th grade
No ratings yet
Olympiad tasks in English for the 5th grade
2 pages
STQA Lab File
No ratings yet
STQA Lab File
34 pages
Marcion and The Dating of Marc and The S
No ratings yet
Marcion and The Dating of Marc and The S
32 pages
10. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Nhóm GV MGB - Đề 10 - File Word Có Lời Giải Chi Tiết
No ratings yet
10. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Nhóm GV MGB - Đề 10 - File Word Có Lời Giải Chi Tiết
6 pages
Subject and Verb Agreement - 2 Exercises With Exceptions
No ratings yet
Subject and Verb Agreement - 2 Exercises With Exceptions
2 pages
Article Definition: 7. Use in Exclamations
No ratings yet
Article Definition: 7. Use in Exclamations
4 pages
Graphic Era Hill University, Dehradun: Presentation & Format of The Project
No ratings yet
Graphic Era Hill University, Dehradun: Presentation & Format of The Project
5 pages
Chapter A3
No ratings yet
Chapter A3
20 pages
Formal Observation 1
No ratings yet
Formal Observation 1
3 pages
Beskin Dividing Line Segment in Given Ratio LML
No ratings yet
Beskin Dividing Line Segment in Given Ratio LML
76 pages
XMS Overview
No ratings yet
XMS Overview
2 pages

Lecture 1 - Introduction

Uploaded by

Lecture 1 - Introduction

Uploaded by

CI-6226

• Dr. Vasily Sidorov

Managing Search Engines:

• “Another way to view text data mining is as a process of

• Information Extraction (IE)

• Cross-Lingual Text Mining

• Most widely used

620 Engineering & allied operations

• Remove Stop/Noise words before/after tokenization

a about all also among an and are as and same in

—No: Harman (1991), Krovetz (1993)

hopped hopp hop (delete one if doubled)

• Samples of the algorithm accessible on the Internet

• Term Frequency (TF)

• CF/DF weighting possible only in known collection

• Word is more important if

• 2nd generation (use off-page, web-specific data)

• 3rd generation (answer “the need behind the query”)

—At each step, go out of the current page along one

—with probability 10%, jump to a random web page

You might also like