Topic 3 W3 Crawls and Feeds - SDR - March2023

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views32 pages

Topic 3 W3 Crawls and Feeds - SDR - March2023

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Search Engines

Information Retrieval in Practice

Web Crawler
• Finds and downloads web pages automatically
– provides the collection for searching
• Web is huge and constantly growing
• Web is not under the control of search engine
providers
• Web pages are constantly changing
• Crawlers also used for other types of data
Retrieving Web Pages
• Every page has a unique uniform resource
locator (URL)
• Web pages are stored on web servers that use
HTTP to exchange information with client
software
• e.g.,
Retrieving Web Pages
• Web crawler connects to a domain name
system (DNS) server
• DNS server translates the hostname into an
internet protocol (IP) address
• Crawler then attempts to connect to server
host using specific port
• After connection, crawler sends an HTTP
request to the web server to request a page
– usually a GET request
Crawling the Web
Web Crawler
• Starts with a set of seeds, which are a set of URLs
given to it as parameters
• Seeds are added to a URL request queue
• Crawler starts fetching pages from the request
queue
• Downloaded pages are parsed (processed to
identify) to find link tags that might contain other
useful URLs to fetch
• New URLs added to the crawler’s request queue.
• Continue until no more new URLs or disk full
Web Crawling
• Web crawlers spend a lot of time waiting for
responses to requests
• To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
• Crawlers could potentially flood sites with
requests for pages
• To avoid this problem, web crawlers use
politeness policies
– e.g., delay between requests to same web server
Controlling Crawling
• Even crawling a site slowly will anger some
web server administrators, who object to any
copying of their data
• Robots.txt file can be used to control crawlers
E.G. SITEMAP
Freshness
• Web pages are constantly being added,
deleted, and modified
• Web crawler must continually revisit pages it
has already crawled to see if they have
changed in order to maintain the freshness of
the document collection
– stale copies no longer reflect the real contents of
the web pages
Freshness
• HTTP protocol has a special request type
called HEAD that makes it easy to check for
page changes
– returns information about page, not page itself
Freshness
• Not possible to constantly check all pages
– must check important pages and pages that
change frequently (e.g. news sites, gov. websites)
• Freshness is the proportion of pages that are
fresh
Focused Crawling
• Attempts to download only those pages that are
about a particular topic
– used by vertical search applications (eg. Mocavo.com
– discover family history; Yelp.com – find local
bussinesses in SF; Trulia.com – property search)
• Rely on the fact that pages about a topic tend to
have links to other pages on the same topic
– popular pages for a topic are typically used as seeds
(E.g?) WebMD
• Crawler uses text classifier to decide whether a
page is on topic (or just a combination of many
topics)
Deep Web
• Sites that are difficult for a crawler to find are
collectively referred to as the deep (or hidden)
Web
– much larger than conventional Web
• Three broad categories:
– private sites
• no incoming links, or may require log in with a valid account
– form results
• sites that can be reached only after entering some data into
a form (eg. applications?)
– scripted pages
• pages that use JavaScript, Flash, or another client-side
language to generate links
Sitemaps
• Sitemaps contain lists of URLs and data about
those URLs, such as modification time and
modification frequency
• Generated by web server administrators
• Tells crawler about pages it might not
otherwise find
• Gives crawler a hint about when to check a
page for changes
• See http://en.wikipedia.org/wiki/Sitemaps
Distributed Crawling
• Three reasons to use multiple computers for
crawling
– Helps to put the crawler closer to the sites it
crawls
– Reduces the number of sites the crawler has to
remember
– Reduces computing resources required
Document Feeds
• Many documents are published
– created at a fixed time and rarely updated again
– e.g., news articles, blog posts, press releases,
email
• Published documents from a single source can
be ordered in a sequence called a document
feed
– new documents found by examining the end of
the feed
Document Feeds
• Two types:
– A push feed alerts the subscriber to new
documents
– A pull feed requires the subscriber to check
periodically for new documents
• Most common format for pull feeds is called
RSS
– Really Simple Syndication, RDF Site Summary, Rich
Site Summary, or ...
Conversion
• Text is stored in hundreds of incompatible file
formats
– e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF,
PDF
• Other types of files also important
– e.g., PowerPoint, Excel
• Typically use a conversion tool
– converts the document content into a tagged text
format such as HTML or XML
– retains some of the important formatting information
Storing the Documents
• Many reasons to store converted document
text
– saves crawling time when page is not updated
– provides efficient access to text for snippet
generation, information extraction, etc.
• Database systems can provide document
storage for some applications
– web search engines use customized document
storage systems
Storing the Documents
• Requirements for document storage system:
– Random access
• request the content of a document based on its URL
– Compression and large files
• reducing storage requirements and efficient access
– Update
• handling large volumes of new and modified
documents
• adding new anchor text (e.g. is the visible, clickable text in a
hyperlink. The words contained in the anchor text can determine the
ranking that the page will receive by search engines)
Large Files
• Store many documents in large files, rather
than each document in a file (quantity of files must
be reduced)
– avoids overhead in opening and closing files
– reduces seek time relative to read time
• Compound documents formats
– used to store multiple documents in a file
Compression
• Compression techniques exploit redundancy
to make files smaller without losing any of the
content
• Indexes are compressed as well
• Popular algorithms can compress HTML and
XML text by 80%
– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress,
PDF)
– may compress large files in blocks to make access
faster
Detecting Duplicates
• Duplicate and near-duplicate documents
occur in many situations
– Copies, versions, plagiarism, spam, mirror sites
– 30% of the web pages in a large crawl are exact or
near duplicates of pages in the other 70%
• Duplicates consume significant resources
during crawling, indexing, and search
– Little value to most users
Duplicate Detection
• Exact duplicate detection is relatively easy
• Checksum techniques
– A checksum is a value that is computed based on the
content of the document
• e.g., sum of the bytes in the document file
T R O P I C A L F I S H Sum
80 65 33 38 12 22 10 45 15 17 12 69 11 429
– Possible for files with different text to have same
checksum
• Functions such as a cyclic redundancy check
(CRC), have been developed that consider the
positions of the bytes
Near-Duplicate Detection
• More challenging task
– Are web pages with same text context but
different advertising or format near-duplicates?
• A near-duplicate document is defined using a
threshold value for some similarity measure
between pairs of documents
– e.g., document D1 is a near-duplicate of
document D2 if more than 90% of the words in
the documents are the same
Fingerprints
Fingerprint Example
16
groups
If n-grams = 3

16
values

Hashing: transformation of a string of characters into a usually shorter fixed-length value or

key that represents the original string
Hash Value
The contents of a file are processed through
a cryptographic algorithm, and a unique
numerical value – the hash value - is
produced that identifies the contents of the
file.
Simhash
• Similarity comparisons using word-based
representations more effective at finding
near-duplicates
– Problem is efficiency
• Simhash combines the advantages of the
word-based similarity measures with the
efficiency of fingerprints based on hashing
Removing Noise
• Many web pages contain text, links, and
pictures that are not directly related to the
main content of the page
• This additional material is mostly noise that
could negatively affect the ranking of the page
• Techniques have been developed to detect
the content blocks in a web page
– Non-content material is either ignored or reduced
in importance in the indexing process
END

Try Tutorial 3

The Picture of Dorian Gray LitChart
100% (1)
The Picture of Dorian Gray LitChart
42 pages
Mario Bunge-Treatise On Basic Philosophy (Ontología I)
89% (9)
Mario Bunge-Treatise On Basic Philosophy (Ontología I)
369 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Chapter 3
No ratings yet
Chapter 3
64 pages
IR unit 3
No ratings yet
IR unit 3
64 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Datamining
No ratings yet
Datamining
21 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
5 More Notes On Information and Communication
No ratings yet
5 More Notes On Information and Communication
45 pages
IR Module 3 (1)
No ratings yet
IR Module 3 (1)
45 pages
Web_Crawler_A_Review
No ratings yet
Web_Crawler_A_Review
5 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
'O'-level-(M2.R4)-Chapter-8 (4) 123
No ratings yet
'O'-level-(M2.R4)-Chapter-8 (4) 123
4 pages
Lecture 5 Information and Communication 2023
No ratings yet
Lecture 5 Information and Communication 2023
45 pages
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Web Mining
No ratings yet
Web Mining
48 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
S4 WEB DESIGN
No ratings yet
S4 WEB DESIGN
19 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Search Engine
No ratings yet
Search Engine
35 pages
Ir 49 72
No ratings yet
Ir 49 72
24 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Effective Searching Policies For Web Crawler
No ratings yet
Effective Searching Policies For Web Crawler
3 pages
Information Technology Systems: 3.4 Internet
No ratings yet
Information Technology Systems: 3.4 Internet
59 pages
Web Crawler
0% (1)
Web Crawler
16 pages
NSums
No ratings yet
NSums
9 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
No ratings yet
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
10 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Module 4 Final Ppt (1)
No ratings yet
Module 4 Final Ppt (1)
103 pages
IR Unit 3
No ratings yet
IR Unit 3
47 pages
TechDuWeb (1).fr.en
No ratings yet
TechDuWeb (1).fr.en
45 pages
Lesson 2
No ratings yet
Lesson 2
26 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Search Engine
No ratings yet
Search Engine
42 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Computer Networks Lecture Notes: Course Code - Course Name
No ratings yet
Computer Networks Lecture Notes: Course Code - Course Name
57 pages
1999 GORDON - Search Engines - Findind Information On The World Wide Web - INFORMATION PROCESSING and MANAGEMENT
No ratings yet
1999 GORDON - Search Engines - Findind Information On The World Wide Web - INFORMATION PROCESSING and MANAGEMENT
40 pages
CRAWLER,INDEX,RANKING
No ratings yet
CRAWLER,INDEX,RANKING
20 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Lecture 2 - Websites
No ratings yet
Lecture 2 - Websites
50 pages
SEO Glossary
No ratings yet
SEO Glossary
6 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Lecture 1 - Internet
No ratings yet
Lecture 1 - Internet
42 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Unit 4
No ratings yet
Unit 4
47 pages
Introduction to Internet
No ratings yet
Introduction to Internet
11 pages
Irt Unit3
No ratings yet
Irt Unit3
50 pages
How Internet Search Engines Work
No ratings yet
How Internet Search Engines Work
6 pages
Web Portal, Blog or Website
No ratings yet
Web Portal, Blog or Website
13 pages
Basic Internet and Email
No ratings yet
Basic Internet and Email
51 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
phpMyAdmin Starter
From Everand
phpMyAdmin Starter
Marc Delisle
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
From Everand
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
Marcus Österberg
4/5 (2)
Soal Latihan Days and Months Kelas 7
No ratings yet
Soal Latihan Days and Months Kelas 7
3 pages
Student Questionnaire
No ratings yet
Student Questionnaire
2 pages
pc_1_12_present_simple_vs__present_continuous_es
No ratings yet
pc_1_12_present_simple_vs__present_continuous_es
8 pages
DLL English 3 Q4 Week 5
No ratings yet
DLL English 3 Q4 Week 5
5 pages
MCA_NEW_Bridge Courses Assignment_ 2025 (2)
No ratings yet
MCA_NEW_Bridge Courses Assignment_ 2025 (2)
5 pages
SLRC Paper IV SET D
No ratings yet
SLRC Paper IV SET D
64 pages
How to Order a Coffee in English! (Come to 5 Coffee Shops With Me!)
No ratings yet
How to Order a Coffee in English! (Come to 5 Coffee Shops With Me!)
11 pages
2 Pythagoras - QUESTIONS PDF
No ratings yet
2 Pythagoras - QUESTIONS PDF
3 pages
Juno Loves Legs Karl Geary instant download
No ratings yet
Juno Loves Legs Karl Geary instant download
41 pages
95 178 1 SM PDF
No ratings yet
95 178 1 SM PDF
7 pages
Puma Brochure
No ratings yet
Puma Brochure
2 pages
Multiple Intelligences in Classroom
No ratings yet
Multiple Intelligences in Classroom
39 pages
Temas - Tarea Sap
No ratings yet
Temas - Tarea Sap
45 pages
The Grammar Lab 1
No ratings yet
The Grammar Lab 1
177 pages
P I N N C L E: Publishers & Distributors
No ratings yet
P I N N C L E: Publishers & Distributors
22 pages
The Birth of A Theatre PDF
No ratings yet
The Birth of A Theatre PDF
208 pages
File List
No ratings yet
File List
40 pages
CH 28
No ratings yet
CH 28
12 pages
BECKHOFF-TC3-002-TwinCAT 3 (2012)
100% (7)
BECKHOFF-TC3-002-TwinCAT 3 (2012)
279 pages
Arupa_Kalita_Patangia
No ratings yet
Arupa_Kalita_Patangia
4 pages
Why Pymel?: The Nature of The Beast
No ratings yet
Why Pymel?: The Nature of The Beast
4 pages
Bloom's Taxonomy of Learning: Cognitive Domain
No ratings yet
Bloom's Taxonomy of Learning: Cognitive Domain
13 pages
Multimedia Database
100% (1)
Multimedia Database
13 pages
An Intellectual History of China, Volume One (2014)
100% (1)
An Intellectual History of China, Volume One (2014)
441 pages
Mock 3
No ratings yet
Mock 3
7 pages
Rajendra Institute of Medical Sciences - Kibo XS & 360 Kit
No ratings yet
Rajendra Institute of Medical Sciences - Kibo XS & 360 Kit
6 pages
Int Science Sba Marking Scheme 2023 - 2024
No ratings yet
Int Science Sba Marking Scheme 2023 - 2024
2 pages
A Smile Can Solve Everything
No ratings yet
A Smile Can Solve Everything
2 pages