0% found this document useful (0 votes)
105 views

Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science

This document summarizes an article that provides examples of using Scrapy, an open source web scraping framework, to extract data from websites. It includes three examples of increasing complexity: 1) extracting weather data from a single page, 2) extracting book details from an online store by making multiple requests, and 3) image scraping. The first example is explained in detail, demonstrating how to define a Scrapy spider to make a request, extract data fields using XPath, define a Scrapy Item to store the data, and yield the Item to output a JSON file.

Uploaded by

vaskore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science

This document summarizes an article that provides examples of using Scrapy, an open source web scraping framework, to extract data from websites. It includes three examples of increasing complexity: 1) extracting weather data from a single page, 2) extracting book details from an online store by making multiple requests, and 3) image scraping. The first example is explained in detail, demonstrating how to define a Scrapy spider to make a request, extract data fields using XPath, define a Scrapy Item to store the data, and yield the Item to output a JSON file.

Uploaded by

vaskore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

WEB SCRAPING SERIES

Web scraping with Scrapy: Practical


Understanding
Hands-on with Scrapy

Karthikeyan P Follow
Jul 31 · 11 min read

Photo by Ilya Pavlov on Unsplash

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 1/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

With all the theoretical aspects of using Scrapy being dealt with in part-1, it’s now time
for some practical examples. I shall put these theoretical aspects into examples of
increasing complexity. There are 3 examples,

An example demonstrating single request & response by extracting a city’s weather


from a weather site

An example demonstrating multiple requests & responses by extracting book details


from a dummy online book store

An example demonstrating image scraping

You can download these examples from my GitHub page. This is the second part of a 4
part tutorial series on web scraping using Scrapy and Selenium. The other parts can be
found at

Part 1: Web scraping with Scrapy: Theoretical Understanding

Part 3: Web scraping with Selenium

Part 4: Web scraping with Selenium & Scrapy

Important note:
Before you try to scrape any website, please go through its robots.txt file. It can be
accessed like www.google.com/robots.txt. There, you will see a list of pages allowed and
disallowed for scraping google’s website. You can access only those pages that fall under
User-agent: * and those that follow Allow: .

. . .

Example 1 — Handling single request & response by extracting a city’s weather


from a weather site
Our goal for this example is to extract today’s ‘Chennai’ city weather report from
weather.com. The extracted data must contain temperature, air quality and
condition/description. You are free to choose your city. Just provide the URL to your city
in the spider’s code. As pointed out earlier, the site allows data to be scraped provided

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 2/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

there is a crawl delay of no less than 10 seconds i.e. you have to wait at least 10 seconds
before requesting another URL from weather.com. This can be found in the site’s
robots.txt.

User-agent: *
# Crawl-delay: 10

I have created a new Scrapy project using scrapy startproject command and created a
basic spider using

scrapy genspider -t basic weather_spider weather.com

The first task while starting to code is to adhere to the site’s policy. To adhere to
weather.com’s crawl delay policy, we need to add the following line to our scrapy
project’s settings.py file.

DOWNLOAD_DELAY = 10

This line makes the spiders in our project to wait 10 seconds before making a new URL
request. We can now start to code our spider.

As shown earlier, the template code is generated. I have made some modifications to the
code.

import scrapy
import re
from ..items import WeatherItem

class WeatherSpiderSpider(scrapy.Spider):
name = "weather_spider"
allowed_domains = ["weather.com"]

def start_requests(self):
# Weather.com URL for Chennai's weather
urls = [
"https://weather.com/en-
IN/weather/today/l/bf01d09009561812f3f95abece23d16e123d8c08fd0b8ec7ff
c9215c0154913c"
]

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 3/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

for url in urls:


yield scrapy.Request(url=url, callback=self.parse_url)

def parse_url(self, response):

# Extracting city, temperature, air quality and condition from the


response using XPath
city =
response.xpath('//h1[contains(@class,"location")]/text()').get()
temp = response.xpath('//span[@data-
testid="TemperatureValue"]/text()').get()
air_quality = response.xpath('//span[@data-
testid="AirQualityCategory"]/text()').get()
cond = response.xpath('//div[@data-
testid="wxPhrase"]/text()').get()

temp = re.match(r"(\d+)", temp).group(1) + " C" # Removing the


degree symbol and adding C
city = re.match(r"^(.*)(?: Weather)", city).group(1) #
Removing 'Weather' from location

# Yielding the extracted data as Item object. You may also extract as
Dictionary
item = WeatherItem()
item["city"] = city
item["temp"] = temp
item["air_quality"] = air_quality
item["cond"] = cond
yield item

I think the code for this example is self-explanatory. I will, however, explain the flow. I
hope you can remember the overall flow diagram of Scrapy from the last part. I wish to
be in control of making requests, so I use start_requests() instead of start_urls . Inside

the start_requests() the URL for Chennai's weather page is specified. If you wish to
change it to your preferred city or add more cities feel free to do it. For every URL in the
list of URLs, generate a request and yield it. All of these requests will reach the
Scheduler, which will then dispatch the requests whenever Engine asks for a request.
After the webpage corresponding to the request is downloaded by the Downloader, the
response is sent back to the engine which directs it to the respective spider. In this case,
WeatherSpider receives the response and calls the callback function parse_url() . Inside

this function, I have used XPath to extract the required data from the response.

You may understand till this part, the next part of the code would be new to you since it
has not yet been explained. I have made use of Scrapy Items. These are Python objects
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 4/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

that define key-value pairs. You can refer to this link to explore more about Items. If you
do not wish to make use of Items, you can create a dictionary and yield it instead.
A question may arise, where to define these so-called items. Allow me to refresh your
memory. While creating a new project, we saw some files being created by Scrapy.
Remember?

weather/
├── scrapy.cfg
└── weather
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── __pycache__
├── settings.py
└── spiders
├── WeatherSpider.py
├── __init__.py
└── __pycache__

If you look patiently along this tree, you may notice a file named items.py . It is into this

file, you need to define the Item objects.

# -*- coding: utf-8 -*-

# Define here the models for your scraped items


#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class WeatherItem(scrapy.Item):
city = scrapy.Field()
temp = scrapy.Field()
air_quality = scrapy.Field()
cond = scrapy.Field()

Scrapy would have created the class, all you need to do is define the key-value pairs. In
this example, since we need city name, temperature, air quality and condition, I have
created 4 items. You can create any number of items as required by your project.
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 5/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

When you run the project using the following command, a JSON file containing the
scraped items would be created.

scrapy crawl weather_spider -o output.json

The contents would look like,

output.json
------------

[
{"city": "Chennai, Tamil Nadu", "temp": "31 C", "air_quality":
"Good", "cond": "Cloudy"}
]

Hurray!!. You have successfully executed a simple Scrapy project handling a single
request and response.

. . .

Example 2 — Handling multiple request & response by extracting book details


from a dummy online book store
Our goal for this example is to scrape the details of all the books (1000 to be exact) from
the website books.toscrape.com. Do not worry about robots.txt. This site is specifically
designed and hosted for the purpose of practising web scraping. So, you are in the clear.
This website is designed in such a way that it has 50 pages with each page listing 20
books. You cannot extract book details from the listing page. You have to navigate to
individual book’s webpage to extract the required details. This is a scenario which
requires crawling multiple webpages, so I will be using Crawl Spider.
Like the previous example, I have created a new project and a crawling spider using
scrapy startproject and

scrapy genspider -t crawl crawl_spider books.toscrape.com

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 6/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

For this example, I will be extracting title of the book, its price, rating and availability.
The items.py file would look like this.

class BookstoscrapeItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
availability = scrapy.Field()

Now that everything needed for the project is ready, let us look into crawl_spider.py .

class CrawlSpiderSpider(CrawlSpider):
name = "crawl_spider"
allowed_domains = ["books.toscrape.com"]
# start_urls = ["http://books.toscrape.com/"] # when trying to
use this, comment start_requests()

rules = (Rule(LinkExtractor(allow=r"catalogue/"),
callback="parse_books", follow=True),)

def start_requests(self):
url = "http://books.toscrape.com/"
yield scrapy.Request(url)

def parse_books(self, response):


""" Filtering out pages other than books' pages to avoid
getting "NotFound" error.
Because, other pages would not have any 'div' tag with
attribute 'class="col-sm-6 product_main"'
"""
if response.xpath('//div[@class="col-sm-6
product_main"]').get() is not None:
title = response.xpath('//div[@class="col-sm-6
product_main"]/h1/text()').get()
price = response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="price_color"]/text()').get()
stock = (
response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="instock availability"]/text()')
.getall()[-1]
.strip()
)
rating = response.xpath('//div[@class="col-sm-6
product_main"]/p[3]/@class').get()

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 7/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

# Yielding the extracted data as Item object.


item = BookstoscrapeItem()
item["title"] = title
item["price"] = price
item["rating"] = rating
item["availability"] = stock
yield item

Have you noticed a change in start_requests() ? Why am I generating a request without

a callback? Was I the one who said every request must have a corresponding callback in
last part? If you had these questions, I applaud your attention to detail and critical
reasoning. Kudos to you!!
Enough of beating around the bush and let me get back to answering your questions. I
have not included a callback in the initial request because rules have the callback
specified in it along with the URL using which subsequent requests are to be made.

The flow would start with me explicitly generating a request with


http://books.toscrape.com. Immediately it is followed by the LinkExtractor extracting
links with the pattern http://books.toscrape.com/catalogue/. The crawling spider starts
generating requests with all the URLs that the LinkExtractor has created with
parse_books as the callback function. These requests are sent to the Scheduler, which in
turn dispatches requests when the Engine asks. The usual flow, like before, continues
until no more requests are left at the Scheduler. When you run this spider using a JSON
output, you would get 1000 books' details.

scrapy crawl crawl_spider -o crawl_spider_output.json

Sample output is shown below.

[
{
"title": "A Light in the Attic",
"price": "\u00a351.77",
"rating": "star-rating Three",
"availability": "In stock (22 available)"
},
{
"title": "Libertarianism for Beginners",
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 8/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

"price": "\u00a351.33",
"rating": "star-rating Two",
"availability": "In stock (19 available)"
},
...
]

#Note: /u00a3 is the unicode representation of £

As mentioned before, this is not the only way of extracting the details of all 1000 books.
A basic spider can also be used to extract the exact details. I have included the code using
a basic spider that does the same. Create a basic spider using the following command.

scrapy genspider -t basic book_spider books.toscrape.com

The basic spider contains the following code.

class BookSpiderSpider(scrapy.Spider):
name = "book_spider"
allowed_domains = ["books.toscrape.com"]

def start_requests(self):
urls = ["http://books.toscrape.com/"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_pages)

def parse_pages(self, response):


"""
The purpose of this method is to look for books listing and
the link for next page.
- When it sees books listing, it generates requests with
individual book's URL with parse_books() as its callback function.
- When it sees a next page URL, it generates a request for
the next page by calling itself as the callback function.
"""

books = response.xpath("//h3")

""" Using response.urljoin() to get individual book page """


"""
for book in books:
book_url =
response.urljoin(book.xpath(".//a/@href").get())
yield scrapy.Request(url=book_url,
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 9/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

callback=self.parse_books)
"""

""" Using response.follow() to get individual book page """


for book in books:
yield response.follow(url=book.xpath(".//a/@href").get(),
callback=self.parse_books)

""" Using response. urljoin() to get next page """


"""
next_page_url =
response.xpath('//li[@class="next"]/a/@href').get()
if next_page_url is not None:
next_page = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page,
callback=self.parse_pages)
"""

""" Using response.follow() to get next page """


next_page_url =
response.xpath('//li[@class="next"]/a/@href').get()
if next_page_url is not None:
yield response.follow(url=next_page_url,
callback=self.parse_pages)

def parse_books(self, response):


"""
Method to extract book details and yield it as Item object
"""

title = response.xpath('//div[@class="col-sm-6
product_main"]/h1/text()').get()
price = response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="price_color"]/text()').get()
stock = (
response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="instock availability"]/text()')
.getall()[-1]
.strip()
)
rating = response.xpath('//div[@class="col-sm-6
product_main"]/p[3]/@class').get()

item = BookstoscrapeItem()
item["title"] = title
item["price"] = price
item["rating"] = rating
item["availability"] = stock
yield item

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 10/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

Have you noticed the same parse_books() method in both the spiders? Method of
extracting book details is the same. The only difference is that I have replaced rules in
crawling spider with a dedicated and long function parse_pages() in the basic spider. I
hope this shows you the distinction between crawling spider and basic spider.

. . .

Example 3 — Image scraping


Before starting with this example, let us look at a brief overview of how Scrapy scrapes
and processes files and images. To scrape files or images from webpages, you need to use
in-built pipelines, specifically, FilesPipeline or ImagesPipeline , for the respective

purpose. I will explain the typical workflow when using FilesPipeline .

1. You have to use a Spider to scrape an item and put the URLs of the desired file into a
file_urls field.

2. You then return the item, which then goes into the item pipeline.

3. When the item reaches the FilesPipeline , the URLs in the file_urls are sent to the
Scheduler to be downloaded by the Downloader. The only difference is that these
file_urls are given higher priority and downloaded before processing any other
requests.

4. When the files are downloaded, another field files will be populated with the
results. It will comprise of the actual download URL, a relative path where it is
stored, its checksum and the status.

FilesPipeline can be used to scrape different types of files (images, pdfs, texts, etc.).
ImagesPipeline is specialized for scraping and processing images. Apart from the
functionalities of FilesPipeline , it does the following:

Convert all downloaded images to JPG format and RGB mode

Generates thumbnails

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 11/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

Check image width/height to make sure they meet a minimum constraint

Also, file names are different. Please use image_urls and images in place of file_urls

and files while working with ImagesPipeline . If you wish to know more about files and

images processing, you can always follow this link.

Our goal for this example is to scrape the cover images of all the books from the website
books.toscrape.com. I will be repurposing the Crawl Spider from the previous example to
achieve our goal. There is one important step to be done before starting with code. You
need to set up the ImagesPipeline . To do this, add the following two lines to

settings.py file in the project folder.

ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "path/to/store/images"

Now you are ready to code. Since I am reusing the crawling spider, there would be no
significant difference to the crawling spider’s code. The only difference is that you need
to create Item objects containing images , image_urls and yield it from the spider.

# -*- coding: utf-8 -*-


import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import ImagescraperItem
import re

class ImageCrawlSpiderSpider(CrawlSpider):
name = "image_crawl_spider"
allowed_domains = ["books.toscrape.com"]
# start_urls = ["http://books.toscrape.com/"]

def start_requests(self):
url = "http://books.toscrape.com/"
yield scrapy.Request(url=url)

rules = (Rule(LinkExtractor(allow=r"catalogue/"),
callback="parse_image", follow=True),)

def parse_image(self, response):


if response.xpath('//div[@class="item active"]/img').get() is

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 12/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

not None:
img = response.xpath('//div[@class="item
active"]/img/@src').get()

"""
Computing the Absolute path of the image file.
"image_urls" require absolute path, not relative path
"""
m = re.match(r"^(?:../../)(.*)$", img).group(1)
url = "http://books.toscrape.com/"
img_url = "".join([url, m])

image = ImagescraperItem()
image["image_urls"] = [img_url] # "image_urls" must be a
list

yield image

The items.py file would look like.

import scrapy

class ImagescraperItem(scrapy.Item):
images = scrapy.Field()
image_urls = scrapy.Field()

When you run the spider with an output file, the spider would crawl all the webpages of
the http://books.toscrape.com, scrape URLs of the books’ covers and yield it as
image_urls , which would then be sent to the Scheduler and the workflow continues as

detailed at the beginning of this example.

scrapy crawl image_crawl_spider -o output.json

The downloaded images would be stored at the location specified by IMAGES_STORE and
the output.json will look like this.

[
{
"image_urls": [

"http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064db
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 13/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

a399c075.jpg"
],
"images": [
{
"url":
"http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064db
a399c075.jpg",
"path": "full/59d0249d6ae2eeb367e72b04740583bc70f81558.jpg",
"checksum": "693caff3d97645e73bd28da8e5974946",
"status": "downloaded"
}
]
},
{
"image_urls": [

"http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc
02ca5c62.jpg"
],
"images": [
{
"url":
"http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc
02ca5c62.jpg",
"path": "full/1c1a130c161d186db9973e70558b6ec221ce7c4e.jpg",
"checksum": "e3953238c2ff7ac507a4bed4485c8622",
"status": "downloaded"
}
]
},
...
]

If you wish to scrape other files of different format, you can use FilesPipeline instead. I
will leave this to your curiosity. You can download these 3 examples from this link.

. . .

Avoiding getting banned


Beginners, who are enthusiastic about web scraping, might go overboard and scrape
websites at an increased rate which might result is their IP getting banned/blacklisted
from the website. Some websites implement certain measures to prevent bots from
crawling them, with varying degrees of sophistication.

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 14/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

The following are some tips to keep in mind when dealing with these kinds of sites and it
is taken from Scrapy Common Practices:

Rotate your user agent from a pool of well-known ones from browsers (google
around to get a list of them).

Disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot
behaviour.

Use download delays (2 or higher). See DOWNLOAD_DELAY setting.

If possible, use Google cache to fetch pages, instead of hitting the sites directly

Use a pool of rotating IPs. For example, the free Tor project or paid services like
ProxyMesh. An open-source alternative is scrapoxy, a super proxy that you can
attach your own proxies to.

Use a highly distributed downloader that circumvents bans internally, so you can
just focus on parsing clean pages. One example of such downloaders is Crawlera

Closing remarks
As my goal is to make you work confidently with Scrapy after reading this tutorial, I have
restrained myself from diving into various intricate aspects of Scrapy. But, I hope that I
have introduced you to the concept and practice of working with Scrapy with a clear
distinction between basic and crawling spiders. If you are interested in swimming to the
deeper end of this pool, feel free to take the guidance of Scrapy official documentation
that can be reached by clicking here.

In the next part of this web scraping series, we shall be looking at Selenium.

Till then, Good luck. Stay safe and happy learning.!

Sign up for The Daily Pick


By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 15/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science

Emails will be sent to [email protected].


Get this newsletter Not you?

Web Scraping Scrapy Python Web Scraping Series

About Help Legal

Get the Medium app

https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 16/16

You might also like