Module1PartBInformationRetrievalWebdocuments
Module1PartBInformationRetrievalWebdocuments
Keyword queries
Boolean queries (using AND, OR, NOT)
Phrase queries
Proximity queries
Full document queries
Natural language questions
may be constructed
Why do we need to remove stopwords?
Reduce indexing (or data) file size
Given a query:
Are all retrieved documents relevant?
Have all the relevant documents been retrieved?
Measures for system performance:
The first question is about the precision of the
search
The second is about the completeness (recall) of
the search.
Architecture of SE
28
Search Engine
Paid
Search Ads
Algorithmic results.
29
Search Engine
Architecture
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
User
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
Web spider
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Search
Indexer
The Web
Indexing Process
31
Search Engine
Indexing Process
Text acquisition
identifies and stores documents for indexing
Text transformation
transforms documents into index terms or features
Index creation
takes index terms and creates data structures
(indexes) to support fast searching
32
Inverted index
The inverted index of a document collection
is basically a data structure that
attaches each distinctive term with a list of all
documents that contains the term.
Thus, in retrieval, it takes constant time to
find the documents that contains a query term.
multiple query terms are also easy handle as we
will see soon.
Query Process
37
Search Engine
Query Process
User interaction
supports creation and refinement of query, display of
results
Ranking
uses query and indexes to generate ranked list of
documents
Evaluation
monitors and measures effectiveness and efficiency
(primarily offline)
38
Indexing Process
39
Indexing Process
Web Crawler
Starts with a set of seeds, which are a set of URLs given to it
as parameters
Seeds are added to a URL request queue
Crawler starts fetching pages from the request queue
Downloaded pages are parsed to find link tags that might
contain other useful URLs to fetch
New URLs added to the crawler’s request queue, or frontier
Continue until no more new URLs or disk full
40
Indexing Process
Crawling picture
URLs crawled
and parsed
Unseen Web
41
Indexing Process
42
Indexing Process
Text Acquisition
Feeds
Real-time streams of documents
e.g., web feeds for news, blogs, video, radio, tv
43
Indexing Process
Text Acquisition
Document data store
Stores text, metadata, and other related content for
documents
Metadata is information about document such as type and
creation date
Other content includes links, anchor text
Provides fast access to document contents for search
engine components
e.g. result list generation
Could use relational database system
More typically, a simpler, more efficient storage system is
used due to huge numbers of documents
44
Indexing Process
Text Transformation
Parser
Processing the sequence of text tokens in the document to
recognize structural elements
e.g., titles, links, headings, etc.
45
Indexing Process
Text Transformation
Stopping
Remove common words
e.g., “and”, “or”, “the”, “in”
Stemming
Group words derived from a common stem
e.g., “computer”, “computers”, “computing”, “compute”
46
Indexing Process
Text Transformation
Link Analysis
Makes use of links and anchor text in web pages
47
Indexing Process
Text Transformation
Information Extraction
Identify classes of index terms that are important for
some applications
e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
Classifier
Identifies class-related metadata for documents
i.e., assigns labels to documents
e.g., topics, reading levels, sentiment
Use depends on application
48
Summary