SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
SCRAPY VS
BEAUTIFUL SOUP
Advantages of Scrapy vs Beautifulsoup - How To Choose?
Let’s examine the following scenario. You are there minding your own
business, programming away (looking at programming memes) when
your boss walks in and assigns you a new project. This project requires
you to scrape a website, and your boss wants plans presented by the
end of the day. Now, I think we can actually consider this a “good”
problem. How can we choose between using a specific library for web
scraping? The architecture is a key part of your project plan to be
delivered.
Let’s examine our options, and we are going to keep them to (in my
opinion) the main web scraping libraries that we can use for our
Python setup. First up, Scrapy and this is because it’s considered more
of a crawler. By this, I mean that it has more built in. I would
recommend using Scrapy if you are looking to build more of a complex
project, or something that might be geared toward deploying in a
production like environment due to the ability to use spiders, items,
pipelines and more.
Depending on your level or fluency of Python, this also might help you
narrow down your choice. Each library has useful documentation to
help examine scraping setups or architecture. But I think
BeautifulSoup will be more geared toward the use of more simple
Python applications.
Yes, you can. “As mentioned here, BeautifulSoup can be used for
parsing HTML responses in Scrapy callbacks. You just have to feed the
response’s body into a BeautifulSoup object and extract whatever data
you need from it.”
Here’s an example spider using BeautifulSoup API, with lxml as the
HTML parser:
class ExampleSpider(scrapy.Spider):
name = "example»
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
https://docs.scrapy.org/en/latest/faq.html
https://docs.scrapy.org/en/latest/topics/request-response.html#using-
formrequest-from-response-to-simulate-a-user-login
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it
failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = ’example.com'
start_urls = ['http://www.example.com/users/login.php']
Class ExampleSpider(scrapy.Spider):
While True:
close_x = self.driver_find_element_by_xpath(‘xpath here’)
try:
close_x.click
except:
break
self.driver.quit()
Scrapy - when you are looking at more complex, large scale, or detailed
projects to take advantage of the built in functionality of the crawling
framework.