Scraping
Scraping
Seminar
Submitted to: Prof. Anoop Patel
Submitted By: Gaurav Arora
11610550
What are we talking about?
Visit https://www.website-name.com/robots.txt
• Dup
• Content • URL
• www • Parse • URL
• Seen? • Filter
• Fetch • Elim
• URL Frontier
Terminologies:
URL Frontier: containing URLs yet to be fetches in the
current crawl. At first, a seed set is stored in URL Frontier,
and a crawler begins by taking a URL from the seed set.
DNS: domain name service resolution. Look up IP address
for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.)
and Links are extracted.
Content Seen: test whether a web page with the same
content has already been seen at another URL. Need to
develop a way to measure the fingerprint of a web page.
Web Crawler? Web Scraping?
• The parse phase of the crawler is more or less
scraping the web page for data.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Python Code:
Output:
Thank you!
• Questions?