Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
Karthikeyan P Follow
Jul 31 · 11 min read
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 1/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
With all the theoretical aspects of using Scrapy being dealt with in part-1, it’s now time
for some practical examples. I shall put these theoretical aspects into examples of
increasing complexity. There are 3 examples,
You can download these examples from my GitHub page. This is the second part of a 4
part tutorial series on web scraping using Scrapy and Selenium. The other parts can be
found at
Important note:
Before you try to scrape any website, please go through its robots.txt file. It can be
accessed like www.google.com/robots.txt. There, you will see a list of pages allowed and
disallowed for scraping google’s website. You can access only those pages that fall under
User-agent: * and those that follow Allow: .
. . .
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 2/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
there is a crawl delay of no less than 10 seconds i.e. you have to wait at least 10 seconds
before requesting another URL from weather.com. This can be found in the site’s
robots.txt.
User-agent: *
# Crawl-delay: 10
I have created a new Scrapy project using scrapy startproject command and created a
basic spider using
The first task while starting to code is to adhere to the site’s policy. To adhere to
weather.com’s crawl delay policy, we need to add the following line to our scrapy
project’s settings.py file.
DOWNLOAD_DELAY = 10
This line makes the spiders in our project to wait 10 seconds before making a new URL
request. We can now start to code our spider.
As shown earlier, the template code is generated. I have made some modifications to the
code.
import scrapy
import re
from ..items import WeatherItem
class WeatherSpiderSpider(scrapy.Spider):
name = "weather_spider"
allowed_domains = ["weather.com"]
def start_requests(self):
# Weather.com URL for Chennai's weather
urls = [
"https://weather.com/en-
IN/weather/today/l/bf01d09009561812f3f95abece23d16e123d8c08fd0b8ec7ff
c9215c0154913c"
]
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 3/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
# Yielding the extracted data as Item object. You may also extract as
Dictionary
item = WeatherItem()
item["city"] = city
item["temp"] = temp
item["air_quality"] = air_quality
item["cond"] = cond
yield item
I think the code for this example is self-explanatory. I will, however, explain the flow. I
hope you can remember the overall flow diagram of Scrapy from the last part. I wish to
be in control of making requests, so I use start_requests() instead of start_urls . Inside
the start_requests() the URL for Chennai's weather page is specified. If you wish to
change it to your preferred city or add more cities feel free to do it. For every URL in the
list of URLs, generate a request and yield it. All of these requests will reach the
Scheduler, which will then dispatch the requests whenever Engine asks for a request.
After the webpage corresponding to the request is downloaded by the Downloader, the
response is sent back to the engine which directs it to the respective spider. In this case,
WeatherSpider receives the response and calls the callback function parse_url() . Inside
this function, I have used XPath to extract the required data from the response.
You may understand till this part, the next part of the code would be new to you since it
has not yet been explained. I have made use of Scrapy Items. These are Python objects
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 4/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
that define key-value pairs. You can refer to this link to explore more about Items. If you
do not wish to make use of Items, you can create a dictionary and yield it instead.
A question may arise, where to define these so-called items. Allow me to refresh your
memory. While creating a new project, we saw some files being created by Scrapy.
Remember?
weather/
├── scrapy.cfg
└── weather
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── __pycache__
├── settings.py
└── spiders
├── WeatherSpider.py
├── __init__.py
└── __pycache__
If you look patiently along this tree, you may notice a file named items.py . It is into this
import scrapy
class WeatherItem(scrapy.Item):
city = scrapy.Field()
temp = scrapy.Field()
air_quality = scrapy.Field()
cond = scrapy.Field()
Scrapy would have created the class, all you need to do is define the key-value pairs. In
this example, since we need city name, temperature, air quality and condition, I have
created 4 items. You can create any number of items as required by your project.
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 5/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
When you run the project using the following command, a JSON file containing the
scraped items would be created.
output.json
------------
[
{"city": "Chennai, Tamil Nadu", "temp": "31 C", "air_quality":
"Good", "cond": "Cloudy"}
]
Hurray!!. You have successfully executed a simple Scrapy project handling a single
request and response.
. . .
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 6/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
For this example, I will be extracting title of the book, its price, rating and availability.
The items.py file would look like this.
class BookstoscrapeItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
availability = scrapy.Field()
Now that everything needed for the project is ready, let us look into crawl_spider.py .
class CrawlSpiderSpider(CrawlSpider):
name = "crawl_spider"
allowed_domains = ["books.toscrape.com"]
# start_urls = ["http://books.toscrape.com/"] # when trying to
use this, comment start_requests()
rules = (Rule(LinkExtractor(allow=r"catalogue/"),
callback="parse_books", follow=True),)
def start_requests(self):
url = "http://books.toscrape.com/"
yield scrapy.Request(url)
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 7/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
a callback? Was I the one who said every request must have a corresponding callback in
last part? If you had these questions, I applaud your attention to detail and critical
reasoning. Kudos to you!!
Enough of beating around the bush and let me get back to answering your questions. I
have not included a callback in the initial request because rules have the callback
specified in it along with the URL using which subsequent requests are to be made.
[
{
"title": "A Light in the Attic",
"price": "\u00a351.77",
"rating": "star-rating Three",
"availability": "In stock (22 available)"
},
{
"title": "Libertarianism for Beginners",
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 8/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
"price": "\u00a351.33",
"rating": "star-rating Two",
"availability": "In stock (19 available)"
},
...
]
As mentioned before, this is not the only way of extracting the details of all 1000 books.
A basic spider can also be used to extract the exact details. I have included the code using
a basic spider that does the same. Create a basic spider using the following command.
class BookSpiderSpider(scrapy.Spider):
name = "book_spider"
allowed_domains = ["books.toscrape.com"]
def start_requests(self):
urls = ["http://books.toscrape.com/"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_pages)
books = response.xpath("//h3")
callback=self.parse_books)
"""
title = response.xpath('//div[@class="col-sm-6
product_main"]/h1/text()').get()
price = response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="price_color"]/text()').get()
stock = (
response.xpath('//div[@class="col-sm-6
product_main"]/p[@class="instock availability"]/text()')
.getall()[-1]
.strip()
)
rating = response.xpath('//div[@class="col-sm-6
product_main"]/p[3]/@class').get()
item = BookstoscrapeItem()
item["title"] = title
item["price"] = price
item["rating"] = rating
item["availability"] = stock
yield item
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 10/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
Have you noticed the same parse_books() method in both the spiders? Method of
extracting book details is the same. The only difference is that I have replaced rules in
crawling spider with a dedicated and long function parse_pages() in the basic spider. I
hope this shows you the distinction between crawling spider and basic spider.
. . .
1. You have to use a Spider to scrape an item and put the URLs of the desired file into a
file_urls field.
2. You then return the item, which then goes into the item pipeline.
3. When the item reaches the FilesPipeline , the URLs in the file_urls are sent to the
Scheduler to be downloaded by the Downloader. The only difference is that these
file_urls are given higher priority and downloaded before processing any other
requests.
4. When the files are downloaded, another field files will be populated with the
results. It will comprise of the actual download URL, a relative path where it is
stored, its checksum and the status.
FilesPipeline can be used to scrape different types of files (images, pdfs, texts, etc.).
ImagesPipeline is specialized for scraping and processing images. Apart from the
functionalities of FilesPipeline , it does the following:
Generates thumbnails
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 11/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
Also, file names are different. Please use image_urls and images in place of file_urls
and files while working with ImagesPipeline . If you wish to know more about files and
Our goal for this example is to scrape the cover images of all the books from the website
books.toscrape.com. I will be repurposing the Crawl Spider from the previous example to
achieve our goal. There is one important step to be done before starting with code. You
need to set up the ImagesPipeline . To do this, add the following two lines to
ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "path/to/store/images"
Now you are ready to code. Since I am reusing the crawling spider, there would be no
significant difference to the crawling spider’s code. The only difference is that you need
to create Item objects containing images , image_urls and yield it from the spider.
class ImageCrawlSpiderSpider(CrawlSpider):
name = "image_crawl_spider"
allowed_domains = ["books.toscrape.com"]
# start_urls = ["http://books.toscrape.com/"]
def start_requests(self):
url = "http://books.toscrape.com/"
yield scrapy.Request(url=url)
rules = (Rule(LinkExtractor(allow=r"catalogue/"),
callback="parse_image", follow=True),)
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 12/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
not None:
img = response.xpath('//div[@class="item
active"]/img/@src').get()
"""
Computing the Absolute path of the image file.
"image_urls" require absolute path, not relative path
"""
m = re.match(r"^(?:../../)(.*)$", img).group(1)
url = "http://books.toscrape.com/"
img_url = "".join([url, m])
image = ImagescraperItem()
image["image_urls"] = [img_url] # "image_urls" must be a
list
yield image
import scrapy
class ImagescraperItem(scrapy.Item):
images = scrapy.Field()
image_urls = scrapy.Field()
When you run the spider with an output file, the spider would crawl all the webpages of
the http://books.toscrape.com, scrape URLs of the books’ covers and yield it as
image_urls , which would then be sent to the Scheduler and the workflow continues as
The downloaded images would be stored at the location specified by IMAGES_STORE and
the output.json will look like this.
[
{
"image_urls": [
"http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064db
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 13/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
a399c075.jpg"
],
"images": [
{
"url":
"http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064db
a399c075.jpg",
"path": "full/59d0249d6ae2eeb367e72b04740583bc70f81558.jpg",
"checksum": "693caff3d97645e73bd28da8e5974946",
"status": "downloaded"
}
]
},
{
"image_urls": [
"http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc
02ca5c62.jpg"
],
"images": [
{
"url":
"http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc
02ca5c62.jpg",
"path": "full/1c1a130c161d186db9973e70558b6ec221ce7c4e.jpg",
"checksum": "e3953238c2ff7ac507a4bed4485c8622",
"status": "downloaded"
}
]
},
...
]
If you wish to scrape other files of different format, you can use FilesPipeline instead. I
will leave this to your curiosity. You can download these 3 examples from this link.
. . .
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 14/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
The following are some tips to keep in mind when dealing with these kinds of sites and it
is taken from Scrapy Common Practices:
Rotate your user agent from a pool of well-known ones from browsers (google
around to get a list of them).
Disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot
behaviour.
If possible, use Google cache to fetch pages, instead of hitting the sites directly
Use a pool of rotating IPs. For example, the free Tor project or paid services like
ProxyMesh. An open-source alternative is scrapoxy, a super proxy that you can
attach your own proxies to.
Use a highly distributed downloader that circumvents bans internally, so you can
just focus on parsing clean pages. One example of such downloaders is Crawlera
Closing remarks
As my goal is to make you work confidently with Scrapy after reading this tutorial, I have
restrained myself from diving into various intricate aspects of Scrapy. But, I hope that I
have introduced you to the concept and practice of working with Scrapy with a clear
distinction between basic and crawling spiders. If you are interested in swimming to the
deeper end of this pool, feel free to take the guidance of Scrapy official documentation
that can be reached by clicking here.
In the next part of this web scraping series, we shall be looking at Selenium.
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 15/16
8/16/2020 Web scraping with Scrapy: Practical Understanding | by Karthikeyan P | Jul, 2020 | Towards Data Science
https://towardsdatascience.com/web-scraping-with-scrapy-practical-understanding-2fbdae337a3b 16/16