0% found this document useful (0 votes)
11 views32 pages

Topic 3 W3 Crawls and Feeds - SDR - March2023

Uploaded by

VISALINI VIJAYAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Topic 3 W3 Crawls and Feeds - SDR - March2023

Uploaded by

VISALINI VIJAYAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008


Web Crawler
• Finds and downloads web pages automatically
– provides the collection for searching
• Web is huge and constantly growing
• Web is not under the control of search engine
providers
• Web pages are constantly changing
• Crawlers also used for other types of data
Retrieving Web Pages
• Every page has a unique uniform resource
locator (URL)
• Web pages are stored on web servers that use
HTTP to exchange information with client
software
• e.g.,
Retrieving Web Pages
• Web crawler connects to a domain name
system (DNS) server
• DNS server translates the hostname into an
internet protocol (IP) address
• Crawler then attempts to connect to server
host using specific port
• After connection, crawler sends an HTTP
request to the web server to request a page
– usually a GET request
Crawling the Web
Web Crawler
• Starts with a set of seeds, which are a set of URLs
given to it as parameters
• Seeds are added to a URL request queue
• Crawler starts fetching pages from the request
queue
• Downloaded pages are parsed (processed to
identify) to find link tags that might contain other
useful URLs to fetch
• New URLs added to the crawler’s request queue.
• Continue until no more new URLs or disk full
Web Crawling
• Web crawlers spend a lot of time waiting for
responses to requests
• To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
• Crawlers could potentially flood sites with
requests for pages
• To avoid this problem, web crawlers use
politeness policies
– e.g., delay between requests to same web server
Controlling Crawling
• Even crawling a site slowly will anger some
web server administrators, who object to any
copying of their data
• Robots.txt file can be used to control crawlers
E.G. SITEMAP
Freshness
• Web pages are constantly being added,
deleted, and modified
• Web crawler must continually revisit pages it
has already crawled to see if they have
changed in order to maintain the freshness of
the document collection
– stale copies no longer reflect the real contents of
the web pages
Freshness
• HTTP protocol has a special request type
called HEAD that makes it easy to check for
page changes
– returns information about page, not page itself
Freshness
• Not possible to constantly check all pages
– must check important pages and pages that
change frequently (e.g. news sites, gov. websites)
• Freshness is the proportion of pages that are
fresh
Focused Crawling
• Attempts to download only those pages that are
about a particular topic
– used by vertical search applications (eg. Mocavo.com
– discover family history; Yelp.com – find local
bussinesses in SF; Trulia.com – property search)
• Rely on the fact that pages about a topic tend to
have links to other pages on the same topic
– popular pages for a topic are typically used as seeds
(E.g?) WebMD
• Crawler uses text classifier to decide whether a
page is on topic (or just a combination of many
topics)
Deep Web
• Sites that are difficult for a crawler to find are
collectively referred to as the deep (or hidden)
Web
– much larger than conventional Web
• Three broad categories:
– private sites
• no incoming links, or may require log in with a valid account
– form results
• sites that can be reached only after entering some data into
a form (eg. applications?)
– scripted pages
• pages that use JavaScript, Flash, or another client-side
language to generate links
Sitemaps
• Sitemaps contain lists of URLs and data about
those URLs, such as modification time and
modification frequency
• Generated by web server administrators
• Tells crawler about pages it might not
otherwise find
• Gives crawler a hint about when to check a
page for changes
• See http://en.wikipedia.org/wiki/Sitemaps
Distributed Crawling
• Three reasons to use multiple computers for
crawling
– Helps to put the crawler closer to the sites it
crawls
– Reduces the number of sites the crawler has to
remember
– Reduces computing resources required
Document Feeds
• Many documents are published
– created at a fixed time and rarely updated again
– e.g., news articles, blog posts, press releases,
email
• Published documents from a single source can
be ordered in a sequence called a document
feed
– new documents found by examining the end of
the feed
Document Feeds
• Two types:
– A push feed alerts the subscriber to new
documents
– A pull feed requires the subscriber to check
periodically for new documents
• Most common format for pull feeds is called
RSS
– Really Simple Syndication, RDF Site Summary, Rich
Site Summary, or ...
Conversion
• Text is stored in hundreds of incompatible file
formats
– e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF,
PDF
• Other types of files also important
– e.g., PowerPoint, Excel
• Typically use a conversion tool
– converts the document content into a tagged text
format such as HTML or XML
– retains some of the important formatting information
Storing the Documents
• Many reasons to store converted document
text
– saves crawling time when page is not updated
– provides efficient access to text for snippet
generation, information extraction, etc.
• Database systems can provide document
storage for some applications
– web search engines use customized document
storage systems
Storing the Documents
• Requirements for document storage system:
– Random access
• request the content of a document based on its URL
– Compression and large files
• reducing storage requirements and efficient access
– Update
• handling large volumes of new and modified
documents
• adding new anchor text (e.g. is the visible, clickable text in a
hyperlink. The words contained in the anchor text can determine the
ranking that the page will receive by search engines)
Large Files
• Store many documents in large files, rather
than each document in a file (quantity of files must
be reduced)
– avoids overhead in opening and closing files
– reduces seek time relative to read time
• Compound documents formats
– used to store multiple documents in a file
Compression
• Compression techniques exploit redundancy
to make files smaller without losing any of the
content
• Indexes are compressed as well
• Popular algorithms can compress HTML and
XML text by 80%
– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress,
PDF)
– may compress large files in blocks to make access
faster
Detecting Duplicates
• Duplicate and near-duplicate documents
occur in many situations
– Copies, versions, plagiarism, spam, mirror sites
– 30% of the web pages in a large crawl are exact or
near duplicates of pages in the other 70%
• Duplicates consume significant resources
during crawling, indexing, and search
– Little value to most users
Duplicate Detection
• Exact duplicate detection is relatively easy
• Checksum techniques
– A checksum is a value that is computed based on the
content of the document
• e.g., sum of the bytes in the document file
T R O P I C A L F I S H Sum
80 65 33 38 12 22 10 45 15 17 12 69 11 429
– Possible for files with different text to have same
checksum
• Functions such as a cyclic redundancy check
(CRC), have been developed that consider the
positions of the bytes
Near-Duplicate Detection
• More challenging task
– Are web pages with same text context but
different advertising or format near-duplicates?
• A near-duplicate document is defined using a
threshold value for some similarity measure
between pairs of documents
– e.g., document D1 is a near-duplicate of
document D2 if more than 90% of the words in
the documents are the same
Fingerprints
Fingerprint Example
16
groups
If n-grams = 3

16
values

Hashing: transformation of a string of characters into a usually shorter fixed-length value or


key that represents the original string
Hash Value
The contents of a file are processed through
a cryptographic algorithm, and a unique
numerical value – the hash value - is
produced that identifies the contents of the
file.
Simhash
• Similarity comparisons using word-based
representations more effective at finding
near-duplicates
– Problem is efficiency
• Simhash combines the advantages of the
word-based similarity measures with the
efficiency of fingerprints based on hashing
Removing Noise
• Many web pages contain text, links, and
pictures that are not directly related to the
main content of the page
• This additional material is mostly noise that
could negatively affect the ranking of the page
• Techniques have been developed to detect
the content blocks in a web page
– Non-content material is either ignored or reduced
in importance in the indexing process
END

Try Tutorial 3

You might also like