0% found this document useful (0 votes)
137 views

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Scrapy and Beautiful Soup are Python libraries for web scraping. Scrapy is recommended for more complex projects that require features like spiders, items, and pipelines, while Beautiful Soup can scrape websites with fewer lines of code and is simpler to use. However, the libraries can also be used together, with Beautiful Soup parsing HTML responses extracted by Scrapy. Selenium can also be integrated with either library by automating a browser within a Scrapy spider. Overall, Scrapy is best for more advanced, large-scale scraping, while Beautiful Soup keeps things simpler for straightforward tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Scrapy and Beautiful Soup are Python libraries for web scraping. Scrapy is recommended for more complex projects that require features like spiders, items, and pipelines, while Beautiful Soup can scrape websites with fewer lines of code and is simpler to use. However, the libraries can also be used together, with Beautiful Soup parsing HTML responses extracted by Scrapy. Selenium can also be integrated with either library by automating a browser within a Scrapy spider. Overall, Scrapy is best for more advanced, large-scale scraping, while Beautiful Soup keeps things simpler for straightforward tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

WEBSCRAPING:

SCRAPY VS
BEAUTIFUL SOUP
Advantages of Scrapy vs Beautifulsoup - How To Choose?

Let’s examine the following scenario. You are there minding your own
business, programming away (looking at programming memes) when
your boss walks in and assigns you a new project. This project requires
you to scrape a website, and your boss wants plans presented by the
end of the day. Now, I think we can actually consider this a “good”
problem. How can we choose between using a specific library for web
scraping? The architecture is a key part of your project plan to be
delivered.

We may be more experienced or familiar with some libraries, or we


might not have ever heard of Scrapy, BeautifulSoup or Selenium. This
article is intended to help you choose the right one or to highlight the
benefits that you can take advantage of for your next scraping project
and to include in the plans for your boss.

Let’s examine our options, and we are going to keep them to (in my
opinion) the main web scraping libraries that we can use for our
Python setup. First up, Scrapy and this is because it’s considered more
of a crawler. By this, I mean that it has more built in. I would
recommend using Scrapy if you are looking to build more of a complex
project, or something that might be geared toward deploying in a
production like environment due to the ability to use spiders, items,
pipelines and more.

BeautifulSoup on the other hand can be more simple, and might


require fewer lines of code to scrape. Scrapy does have it’s advantage of
launching a project setup with a few commands, and it’s built in
functionality, email and cloud options, but BeautifulSoup is very
straightforward to write minimal code for a project. This, combined
with requests makes it useful for general or straight forward scraping
tasks.

Moreover, BeautifulSoups use of parsers such as LXML make it highly


accurate when examining web scraping results.

Depending on your level or fluency of Python, this also might help you
narrow down your choice. Each library has useful documentation to
help examine scraping setups or architecture. But I think
BeautifulSoup will be more geared toward the use of more simple
Python applications.

CAN YOU USE SCRAPY AND BEAUTIFULSOUP TOGETHER?

Yes, you can. “As mentioned here, BeautifulSoup can be used for
parsing HTML responses in Scrapy callbacks. You just have to feed the
response’s body into a BeautifulSoup object and extract whatever data
you need from it.”
Here’s an example spider using BeautifulSoup API, with lxml as the
HTML parser:

from bs4 import BeautifulSoup


import scrapy

class ExampleSpider(scrapy.Spider):
name = "example»
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)

def parse(self, response):


# use lxml to get decent HTML parsing speed
soup = BeautifulSoup(response.text, 'lxml')
yield {
"url": response.url,
"title": soup.h1.string
}

https://docs.scrapy.org/en/latest/faq.html

Also, we have mentioned that you can use BeautifulSoup and


Selenium to pass login information. It’s also to utilize Scrapy for this:

Login with Scapy:

https://docs.scrapy.org/en/latest/topics/request-response.html#using-
formrequest-from-response-to-simulate-a-user-login

import scrapy

def authentication_failed(response):
# TODO: Check the contents of the response and return True if it
failed
# or False if it succeeded.
pass

class LoginSpider(scrapy.Spider):
name = ’example.com'
start_urls = ['http://www.example.com/users/login.php']

def parse(self, response):


return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)

def after_login(self, response):


if authentication_failed(response):
self.logger.error("Login failed")
return

# continue scraping with authenticated session...


HOW ABOUT INTEGRATING SELENIUM?

When looking to use Selenium for automation, both libraries can be


used. BeautifulSoup due to it’s setup falls into the easier category, but
Scrapy isn’t much more of a challenge to feature Selenium setups
within a Spider.

In the course we saw the use of Selenium with BeautifulSoup. With


Scrapy, it would look like something such as the following snippet with
Selenium built into the parse:

Class ExampleSpider(scrapy.Spider):

Name = “Test Spider”


allowed_domains = [“example.com”]
start_url = [“example.com/thispagedemo”]
def __init__(self):
self.driver = webdriver.Chrome()

def parse(self, response):


self.driver.get(response.url)

While True:
close_x = self.driver_find_element_by_xpath(‘xpath here’)

try:

close_x.click

except:
break

self.driver.quit()

By examining this snippet we can see that adding selenium


functionality into a Spider doesn’t require too much more, but we want
to feature it when we are parsing in our response.
LET’S KEEP IT SIMPLE

Overall, I think we can simplify the choice:

Scrapy - when you are looking at more complex, large scale, or detailed
projects to take advantage of the built in functionality of the crawling
framework.

BeautifulSoup - to keep things simple, write fewer lines of code with


an easier or more straight forward scraping approach

With this information in hand, you will be able to decide on the


architecture that will put your project in a position to succeed.

You might also like