100% found this document useful (1 vote)
327 views

Scraping

This document discusses web scraping and how it works. It defines web scraping as techniques used to extract data from websites, typically by programmatically simulating the behavior of a human web browser. It describes how web scrapers work by navigating the DOM model of websites and using tags, IDs, and classes to select and extract specific elements. The document also discusses how search engines use web crawlers that follow hyperlinks to index websites and power search functionality. Python libraries like Requests and BeautifulSoup are introduced as tools for building basic web scraping programs.

Uploaded by

Gaurav Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
327 views

Scraping

This document discusses web scraping and how it works. It defines web scraping as techniques used to extract data from websites, typically by programmatically simulating the behavior of a human web browser. It describes how web scrapers work by navigating the DOM model of websites and using tags, IDs, and classes to select and extract specific elements. The document also discusses how search engines use web crawlers that follow hyperlinks to index websites and power search functionality. Python libraries like Requests and BeautifulSoup are introduced as tools for building basic web scraping programs.

Uploaded by

Gaurav Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

WEB SCRAPING

Seminar
Submitted to: Prof. Anoop Patel
Submitted By: Gaurav Arora
11610550
What are we talking about?

• Web Scraping is the term used for various techniques,


by which we fetch any kind of data from other websites!

• In simple words, extracting data from websites is web


scraping.

• Newer definition even includes listening to data feeds


from web servers
Umm, I’m not sure what is data fetching.

• Fetching is defined as downloading of a web


page, which is done by the browser, which
converts the HTML into a readable format.
Is it the only way?
Web scraping is definitely not the only possible
way to extract information out of a webpage.
The following ways are possible:

• Manual copying and pasting


• API’s
• Homemade tools.
• Scraping tools like Scrapy.
What’s Legal and what’s not?
Some websites provide their own API’s to extract
data out of them which is till date the best,
secure and most legal way of extracting data.

While, all of the websites may not provide an


API, we can always use a homemade scraping
tool, or a Scraping engine like Scrapy.
How to be sure, I’m not violating any rule?

Consider you’re thinking about scraping data out


of a particular website’s page.

Visit https://www.website-name.com/robots.txt

This file includes all the details of what’s allowed


and what’s not!
Let’s see what wikipedia allows
How does this scraping thingy works?
• The basic concept behind web scraping is the
DOM model of the website.

• DOM model: It defines the logical structure of


documents and the way a document is accessed
and manipulated, using classes, id’s etc.
A General DOM:
HTML – The structure of a webpage
• Web browsers use HTML (HyperText Markup
Language) to display webpages.
• Composed of elements (tags). Elements are composed
of a start tag <element> and a closing tag </element>
• Ids: Are unique on a page. There will only be one
element with the id “awesome”.
<element id=“awesome”></element>
• Classes: Used for categorizing elements. There can be
many elements with the class “not-as-cool”
<element class=“not-as-cool”></element>
Navigating HTML
• We can navigate through HTML
by using a combination of tags,
ids, and classes.
• Using Selectors
http://www.w3schools.com/cssr
ef/css_selectors.asp
• To find the links in the main
navigation:
nav#main-nav > ul > li
• To get the featured image:
div#main-content >
div.featured-image >
img[src]
Nah, I’ll just copy and paste data!

• The main utility of web scraping comes to use in


web crawlers.
• Most of the known search engines work by
crawing the world wide web.

• What’s a crawler/ crawling, glad you asked!


WHAT IS SEARCH ENGINE?
• In hindsight a search engine can be defined as a
queryable database which collects information from web
pages on the Internet, indexes the information and then
stores it in a huge database from where it can be quickly
searched.
ARCHITECTURE OF SEARCH ENGINE
WHAT IS WEB CRAWLER?
• Software programs that traverse the world wide
web information space by following hypertext
links and retrieving web documents by standard
HTTP protocol.
ARCHITECHTURE OF WEB CRAWLER
• Doc • Robots • U
• Fingerpri • R
template
nt L
s
• se
• DNS t

• Dup
• Content • URL
• www • Parse • URL
• Seen? • Filter
• Fetch • Elim

• URL Frontier
Terminologies:
 URL Frontier: containing URLs yet to be fetches in the
current crawl. At first, a seed set is stored in URL Frontier,
and a crawler begins by taking a URL from the seed set.
 DNS: domain name service resolution. Look up IP address
for domain names.
 Fetch: generally use the http protocol to fetch the URL.
 Parse: the page is parsed. Texts (images, videos, and etc.)
and Links are extracted.
 Content Seen: test whether a web page with the same
content has already been seen at another URL. Need to
develop a way to measure the fingerprint of a web page.
Web Crawler? Web Scraping?
• The parse phase of the crawler is more or less
scraping the web page for data.

• It’s actually not difficult to parse data out of a


web page, let’s see how it works.
How do I make a scraping tool.
• The first step is to really study the website’s
structure well, how elements are linked together.
For e.g: If I want to fetch all the links on the
page the <a> tag is the common link between
all the links.
Similarly, a particular set of headings may
have the same class.
Found the common link!
• So, now that we know what Scraping is, let’s
start making some basic scraping tools in python
using requests, BeautifulSoup libraries.

• Get all the hyperlinks from the homepage of


geeksforgeeks.
Web Scraping using Python and libraries.

To make a home-made custom tool for scraping


data, we use 2 python libraries:
• Requests: To fetch webpages, using HTTP
requests.

• BeautifulSoup: BeautifulSoup is a library which


helps cleaning the data fetched by HTTP request.
BeautifulSoup
• Visit the link for the documentation:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Python Code:
Output:
Thank you!

• Questions?

You might also like