0% found this document useful (0 votes)
2 views

Module1PartBInformationRetrievalWebdocuments

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module1PartBInformationRetrievalWebdocuments

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Module 1

: Information Retrieval and


Web Search
An introduction
Introduction
 Text mining refers to data mining using text
documents as data.
 Most text mining tasks use Information
Retrieval (IR) methods to pre-process text
documents.
 These methods are quite different from
traditional data pre-processing methods
used for relational tables.
 Web search also has its root in IR.

CS583, Bing Liu, UIC 2


Information Retrieval (IR)
 Conceptually, IR is the study of finding needed
information. I.e., IR helps users find information
that matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval,
emphasizing document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution of
information.

CS583, Bing Liu, UIC 3


IR architecture

CS583, Bing Liu, UIC 4


IR queries

 Keyword queries
 Boolean queries (using AND, OR, NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions

CS583, Bing Liu, UIC 5


Information retrieval models
 An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc

CS583, Bing Liu, UIC 6


Boolean model
 Each document or query is treated as a
“bag” of words or terms. Word sequence is
not considered.
 Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive words/terms
in the collection. V is called the vocabulary.
 A weight wij > 0 is associated with each term
ti of a document dj ∈ D. For a term that does
not appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),

CS583, Bing Liu, UIC 7


Boolean model (contd)
 Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves
every document that makes the query logically
true.
 Called exact match.
 The retrieval results are usually quite poor
because term frequency is not considered.
CS583, Bing Liu, UIC 8
Vector space model
 Documents are also treated as a “bag” of words or
terms.
 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1.
Each term weight is computed based on some
variations of TF or TF-IDF scheme.

 Term Frequency (TF) Scheme: The weight of a term


ti in document dj is the number of times that ti
appears in dj, denoted by fij. Normalization may also
be applied.
CS583, Bing Liu, UIC 9
TF-IDF term weighting scheme
 The most well known
weighting scheme
 TF: still term frequency
 IDF: inverse document
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
 The final TF-IDF term
weight is:

CS583, Bing Liu, UIC 10


Retrieval in vector space model
 Query q is represented in the same way or slightly
differently.
 Relevance of di to q: Compare the similarity of
query q and document di.
 Cosine similarity (the cosine of the angle between
the two vectors)

 Cosine is also commonly used in text clustering

CS583, Bing Liu, UIC 11


An Example
 A document space is defined by three terms:
 hardware, software, users
 the vocabulary
 A set of documents are defined as:
 A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
 A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
 A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

 If the Query is “hardware and software”


 what documents should be retrieved?

CS583, Bing Liu, UIC 12


An Example (cont.)
 In Boolean query matching:
 document A4, A7 will be retrieved (“AND”)
 retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
 In similarity matching (cosine):
 q=(1, 1, 0)
 S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
 Document retrieved set (with ranking)=
 {A4, A7, A1, A2, A5, A6, A8, A9}

CS583, Bing Liu, UIC 13


Okapi relevance method
 Another way to assess the degree of relevance is to
directly compute a relevance score for each
document to the query.
 The Okapi method and its variations are popular
techniques in this setting.

CS583, Bing Liu, UIC 14


Relevance feedback
 Relevance feedback is one of the techniques for
improving retrieval effectiveness. The steps:
 the user first identifies some relevant (Dr) and irrelevant
documents (Dir) in the initial list of retrieved documents
 the system expands the query q by extracting some
additional terms from the sample relevant and irrelevant
documents to produce qe
 Perform a second round of retrieval.
 Rocchio method (α, β and γ are parameters)

CS583, Bing Liu, UIC 15


Rocchio text classifier
 In fact, a variation of the Rocchio method above,
called the Rocchio classification method, can be
used to improve retrieval effectiveness too
 Rocchio classifier is constructed by producing a
prototype vector ci for each class i (relevant or
irrelevant in this case):

 In classification, cosine is used.

CS583, Bing Liu, UIC 16


Text pre-processing

 Word (term) extraction: easy


 Stopwords removal
 Stemming
 Frequency counts and computing TF-IDF
term weights.

CS583, Bing Liu, UIC 17


Stopwords removal
 Many of the most frequently used words in English are useless
in IR and text mining – these words are called stop words.
 the, of, and, to, ….

 Typically about 400 to 500 such words

 For an application, an additional domain specific stopwords list

may be constructed
 Why do we need to remove stopwords?
 Reduce indexing (or data) file size

stopwords accounts 20-30% of total word counts.


 Improve efficiency and effectiveness


 stopwords are not useful for searching or text mining

 they may also confuse the retrieval system.

CS583, Bing Liu, UIC 18


Stemming
 Techniques used to find out the root/stem of a
word. E.g.,
 user engineering
 users engineered
 used engineer
 using
 stem: use engineer
Usefulness:
 improving effectiveness of IR and text mining
 matching similar words
 Mainly improve recall
 reducing indexing size
 combing words with same roots may reduce indexing
size as much as 40-50%.

CS583, Bing Liu, UIC 19


Basic stemming methods
Using a set of rules. E.g.,
 remove ending
 if a word ends with a consonant other than s,
followed by an s, then delete s.
 if a word ends in es, drop the s.
 if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
 If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
 …...
 transform words
 if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

CS583, Bing Liu, UIC 20


Frequency counts + TF-IDF
 TF-Counts the number of times a word
occurred in a document.
 Using occurrence frequencies to indicate relative
importance of a word in a document.
 if a word appears often in a document, the document
likely “deals with” subjects related to the word.
 IDF-Counts the number of documents in the
collection that contains each word
 TF-IDF can be computed.

CS583, Bing Liu, UIC 21


Evaluation: Precision and Recall

 Given a query:
 Are all retrieved documents relevant?
 Have all the relevant documents been retrieved?
 Measures for system performance:
 The first question is about the precision of the
search
 The second is about the completeness (recall) of
the search.

CS583, Bing Liu, UIC 22


Precision-recall curve

CS583, Bing Liu, UIC 23


Example
Consider a document
collection D with 20
documents. Given a
query q, we know that
eight documents are
relevant to q. A retrieval
algorithm produces the
ranking (of all documents
in D)

CS583, Bing Liu, UIC 24


Rank precision

 Compute the precision values at some


selected rank positions.
 Mainly used in Web search evaluation.
 For a Web search engine, we can compute
precisions for the top 5, 10, 15, 20, 25 and 30
returned pages
 as the user seldom looks at more than 30 pages.
 Recall is not very meaningful in Web search.
 Why?
CS583, Bing Liu, UIC 25
Web Search as a huge IR system

 A Web crawler (robot) crawls the Web to


collect all the pages.
 Servers establish a huge inverted indexing
database and other indexing databases
 At query (search) time, search engines
conduct different types of vector query
matching.

CS583, Bing Liu, UIC 26


Different search engines
 The real differences among different search
engines are
 their index weighting schemes
 Including location of terms, e.g., title, body,
emphasized words, etc.
 their query processing methods (e.g., query
classification, expansion, etc)
 their ranking algorithms
 Few of these are published by any of the search
engine companies. They are tightly guarded
secrets.

CS583, Bing Liu, UIC 27


 Search Engine

Architecture of SE

How do search engines like Google work?

28
 Search Engine

Paid

Search Ads

Algorithmic results.
29
 Search Engine

Architecture
Sponsored Links

CG Appliance Express
Discount Appliances (650) 756-3931

User
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA

Miele Vacuum Cleaners


Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com

Miele Vacuum Cleaners


Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise


At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...

Web spider
www.miele.com/ - 20k - Cached - Similar pages

Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this


page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ]


Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages

Search

Indexer

The Web

Indexes Ad indexes


30
 Search Engine

Indexing Process

31
 Search Engine

Indexing Process
 Text acquisition
 identifies and stores documents for indexing

 Text transformation
 transforms documents into index terms or features

 Index creation
 takes index terms and creates data structures
(indexes) to support fast searching

32
Inverted index
 The inverted index of a document collection
is basically a data structure that
 attaches each distinctive term with a list of all
documents that contains the term.
 Thus, in retrieval, it takes constant time to
 find the documents that contains a query term.
 multiple query terms are also easy handle as we
will see soon.

CS583, Bing Liu, UIC 33


An example

CS583, Bing Liu, UIC 34


Index construction- Example

CS583, Bing Liu, UIC 35


Search using inverted index

Given a query q, search has the following steps:


 Step 1 (vocabulary search): find each
term/word in q in the inverted index.
 Step 2 (results merging): Merge results to
find documents that contain all or some of the
words/terms in q.
 Step 3 (Rank score computation): To rank
the resulting documents/pages, using,
 content-based ranking
 link-based ranking

CS583, Bing Liu, UIC 36


 Search Engine

Query Process

37
 Search Engine

Query Process
 User interaction
 supports creation and refinement of query, display of
results

 Ranking
 uses query and indexes to generate ranked list of
documents
 Evaluation
 monitors and measures effectiveness and efficiency
(primarily offline)

38
Indexing Process

Details: Text Acquisition


 Crawler
 Identifies and acquires documents for search engine
 Many types – web, enterprise, desktop
 Web crawlers follow links to find documents
 Must efficiently find huge numbers of web pages (coverage)
and keep them up-to-date (freshness)
 Single site crawlers for site search
 Topical or focused crawlers for vertical search
 Document crawlers for enterprise and desktop search
 Follow links and scan directories

39
Indexing Process

Web Crawler
 Starts with a set of seeds, which are a set of URLs given to it
as parameters
 Seeds are added to a URL request queue
 Crawler starts fetching pages from the request queue
 Downloaded pages are parsed to find link tags that might
contain other useful URLs to fetch
 New URLs added to the crawler’s request queue, or frontier
 Continue until no more new URLs or disk full

40
Indexing Process

Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web

41
Indexing Process

Crawling the Web

42
Indexing Process

Text Acquisition
 Feeds
 Real-time streams of documents
 e.g., web feeds for news, blogs, video, radio, tv

 RSS is common standard


 RSS “reader” can provide new XML documents to search
engine
 Conversion
 Convert variety of documents into a consistent text plus
metadata format
 e.g. HTML, XML, Word, PDF, etc. → XML

 Convert text encoding for different languages


 Using a Unicode standard like UTF-8

43
Indexing Process

Text Acquisition
 Document data store
 Stores text, metadata, and other related content for
documents
 Metadata is information about document such as type and
creation date
 Other content includes links, anchor text
 Provides fast access to document contents for search
engine components
 e.g. result list generation
 Could use relational database system
 More typically, a simpler, more efficient storage system is
used due to huge numbers of documents
44
Indexing Process

Text Transformation
 Parser
 Processing the sequence of text tokens in the document to
recognize structural elements
 e.g., titles, links, headings, etc.

 Tokenizer recognizes “words” in the text


 must consider issues like capitalization, hyphens,
apostrophes, non-alpha characters, separators
 Markup languages such as HTML, XML often used to specify
structure
 Tags used to specify document elements
 E.g., <h2> Overview </h2>
 Document parser uses syntax of markup language (or other
formatting) to identify structure

45
Indexing Process

Text Transformation
 Stopping
 Remove common words
 e.g., “and”, “or”, “the”, “in”

 Some impact on efficiency and effectiveness


 Can be a problem for some queries

 Stemming
 Group words derived from a common stem
 e.g., “computer”, “computers”, “computing”, “compute”

 Usually effective, but not for all queries


 Benefits vary for different languages

46
Indexing Process

Text Transformation
 Link Analysis
 Makes use of links and anchor text in web pages

 Link analysis identifies popularity and community


information
 e.g., PageRank
 Anchor text can significantly enhance the
representation of pages pointed to by links
 Significant impact on web search

47
Indexing Process

Text Transformation
 Information Extraction
 Identify classes of index terms that are important for
some applications
 e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
 Classifier
 Identifies class-related metadata for documents
 i.e., assigns labels to documents
 e.g., topics, reading levels, sentiment
 Use depends on application

48
Summary

 Web Document Representation


 Boolean and Vector Space models
 Preprocessing
 Web search engine
 Architecture
 Overview of crawling

CS583, Bing Liu, UIC 49

You might also like