0% found this document useful (0 votes)

141 views6 pages

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Scrapy and Beautiful Soup are Python libraries for web scraping. Scrapy is recommended for more complex projects that require features like spiders, items, and pipelines, while Beautiful Soup can scrape websites with fewer lines of code and is simpler to use. However, the libraries can also be used together, with Beautiful Soup parsing HTML responses extracted by Scrapy. Selenium can also be integrated with either library by automating a browser within a Scrapy spider. Overall, Scrapy is best for more advanced, large-scale scraping, while Beautiful Soup keeps things simpler for straightforward tasks.

Uploaded by

Anupamkumar Brahma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views6 pages

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Uploaded by

Anupamkumar Brahma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

WEBSCRAPING:

SCRAPY VS
BEAUTIFUL SOUP
Advantages of Scrapy vs Beautifulsoup - How To Choose?

Let’s examine the following scenario. You are there minding your own
business, programming away (looking at programming memes) when
your boss walks in and assigns you a new project. This project requires
you to scrape a website, and your boss wants plans presented by the
end of the day. Now, I think we can actually consider this a “good”
problem. How can we choose between using a speciﬁc library for web
scraping? The architecture is a key part of your project plan to be
delivered.

We may be more experienced or familiar with some libraries, or we

might not have ever heard of Scrapy, BeautifulSoup or Selenium. This
article is intended to help you choose the right one or to highlight the
beneﬁts that you can take advantage of for your next scraping project
and to include in the plans for your boss.

Let’s examine our options, and we are going to keep them to (in my
opinion) the main web scraping libraries that we can use for our
Python setup. First up, Scrapy and this is because it’s considered more
of a crawler. By this, I mean that it has more built in. I would
recommend using Scrapy if you are looking to build more of a complex
project, or something that might be geared toward deploying in a
production like environment due to the ability to use spiders, items,
pipelines and more.

BeautifulSoup on the other hand can be more simple, and might

require fewer lines of code to scrape. Scrapy does have it’s advantage of
launching a project setup with a few commands, and it’s built in
functionality, email and cloud options, but BeautifulSoup is very
straightforward to write minimal code for a project. This, combined
with requests makes it useful for general or straight forward scraping
tasks.

Moreover, BeautifulSoups use of parsers such as LXML make it highly

accurate when examining web scraping results.

Depending on your level or ﬂuency of Python, this also might help you
narrow down your choice. Each library has useful documentation to
help examine scraping setups or architecture. But I think
BeautifulSoup will be more geared toward the use of more simple
Python applications.

CAN YOU USE SCRAPY AND BEAUTIFULSOUP TOGETHER?

Yes, you can. “As mentioned here, BeautifulSoup can be used for
parsing HTML responses in Scrapy callbacks. You just have to feed the
response’s body into a BeautifulSoup object and extract whatever data
you need from it.”
Here’s an example spider using BeautifulSoup API, with lxml as the
HTML parser:

from bs4 import BeautifulSoup

import scrapy

class ExampleSpider(scrapy.Spider):
name = "example»
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)

def parse(self, response):

# use lxml to get decent HTML parsing speed
soup = BeautifulSoup(response.text, 'lxml')
yield {
"url": response.url,
"title": soup.h1.string
}

https://docs.scrapy.org/en/latest/faq.html

Also, we have mentioned that you can use BeautifulSoup and

Selenium to pass login information. It’s also to utilize Scrapy for this:

Login with Scapy:

https://docs.scrapy.org/en/latest/topics/request-response.html#using-
formrequest-from-response-to-simulate-a-user-login

import scrapy

def authentication_failed(response):
# TODO: Check the contents of the response and return True if it
failed
# or False if it succeeded.
pass

class LoginSpider(scrapy.Spider):
name = ’example.com'
start_urls = ['http://www.example.com/users/login.php']

def parse(self, response):

return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)

def after_login(self, response):

if authentication_failed(response):
self.logger.error("Login failed")
return

# continue scraping with authenticated session...

HOW ABOUT INTEGRATING SELENIUM?

When looking to use Selenium for automation, both libraries can be

used. BeautifulSoup due to it’s setup falls into the easier category, but
Scrapy isn’t much more of a challenge to feature Selenium setups
within a Spider.

In the course we saw the use of Selenium with BeautifulSoup. With

Scrapy, it would look like something such as the following snippet with
Selenium built into the parse:

Class ExampleSpider(scrapy.Spider):

Name = “Test Spider”

allowed_domains = [“example.com”]
start_url = [“example.com/thispagedemo”]
def __init__(self):
self.driver = webdriver.Chrome()

def parse(self, response):

self.driver.get(response.url)

While True:
close_x = self.driver_ﬁnd_element_by_xpath(‘xpath here’)

try:

close_x.click

except:
break

self.driver.quit()

By examining this snippet we can see that adding selenium

functionality into a Spider doesn’t require too much more, but we want
to feature it when we are parsing in our response.
LET’S KEEP IT SIMPLE

Overall, I think we can simplify the choice:

Scrapy - when you are looking at more complex, large scale, or detailed
projects to take advantage of the built in functionality of the crawling
framework.

BeautifulSoup - to keep things simple, write fewer lines of code with

an easier or more straight forward scraping approach

With this information in hand, you will be able to decide on the

architecture that will put your project in a position to succeed.

Download ebooks file The Evolution of Consciousness Implications for Mental Health and Quality of Life 1st Edition Bjørn Grinde (Auth.) all chapters
100% (3)
Download ebooks file The Evolution of Consciousness Implications for Mental Health and Quality of Life 1st Edition Bjørn Grinde (Auth.) all chapters
55 pages
Part & Information Flow Chart
No ratings yet
Part & Information Flow Chart
4 pages
Information Sheet Alumni
No ratings yet
Information Sheet Alumni
1 page
Minimum, Recommended, and Latest Code Versions for Dell Technologies Servers, Storage, and Networking products (000205512)
No ratings yet
Minimum, Recommended, and Latest Code Versions for Dell Technologies Servers, Storage, and Networking products (000205512)
18 pages
游戏行业聚焦2024上半年回顾-Sensor Tower
No ratings yet
游戏行业聚焦2024上半年回顾-Sensor Tower
35 pages
HTML
No ratings yet
HTML
7 pages
Becoming a Chartered Accountant Booklet
No ratings yet
Becoming a Chartered Accountant Booklet
14 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
Octoplant Broschuere en
No ratings yet
Octoplant Broschuere en
11 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Black Box Guide (Stock Footage)
100% (1)
Black Box Guide (Stock Footage)
14 pages
Review Questions 6
No ratings yet
Review Questions 6
3 pages
A Comprehensive Overview of Knowledge Graph Completion
No ratings yet
A Comprehensive Overview of Knowledge Graph Completion
65 pages
b
No ratings yet
b
77 pages
Data Science Projects
No ratings yet
Data Science Projects
3 pages
Niilm Unversity Course Fee
No ratings yet
Niilm Unversity Course Fee
10 pages
Web Scraping
No ratings yet
Web Scraping
7 pages
GxP-Compliant Calibration
No ratings yet
GxP-Compliant Calibration
4 pages
Aeroflex Test Equipment Catalog
100% (1)
Aeroflex Test Equipment Catalog
48 pages
Beginners Python Cheat Sheets Sample
100% (1)
Beginners Python Cheat Sheets Sample
8 pages
ACI-Dashboard
No ratings yet
ACI-Dashboard
8 pages
Amna Mubeen Merged Synopsis File
No ratings yet
Amna Mubeen Merged Synopsis File
52 pages
web scraping using python
No ratings yet
web scraping using python
18 pages
Original
No ratings yet
Original
10 pages
Configure A VPN Connection
No ratings yet
Configure A VPN Connection
1 page
Python Libraries 2024
No ratings yet
Python Libraries 2024
114 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Machine Learning GenAI Roadma
No ratings yet
Machine Learning GenAI Roadma
36 pages
Python Cheatsheet
100% (2)
Python Cheatsheet
51 pages
The Ultimate HTML Cheat Sheet
100% (1)
The Ultimate HTML Cheat Sheet
13 pages
Naveen Python - For - Data-Science-Report
100% (1)
Naveen Python - For - Data-Science-Report
24 pages
Exp 1
No ratings yet
Exp 1
2 pages
Underground Cable Fault Location Using Arduino, GSM & GPS: Presentation of Main Project On
No ratings yet
Underground Cable Fault Location Using Arduino, GSM & GPS: Presentation of Main Project On
24 pages
PYTHON PANDAS-Module1
No ratings yet
PYTHON PANDAS-Module1
4 pages
Markov Vs Arima
No ratings yet
Markov Vs Arima
93 pages
@digitalearn_official 100 black hat tools name (1)
No ratings yet
@digitalearn_official 100 black hat tools name (1)
5 pages
CSS Cheat Sheet: Selecteur
No ratings yet
CSS Cheat Sheet: Selecteur
1 page
Incident Investigation of Suspicious Activity Alert File Execution
No ratings yet
Incident Investigation of Suspicious Activity Alert File Execution
6 pages
A19 - Properties of Log, Equations VJ
No ratings yet
A19 - Properties of Log, Equations VJ
192 pages
XSS Answers
No ratings yet
XSS Answers
7 pages
Microsoft SQL Server
No ratings yet
Microsoft SQL Server
111 pages
RP Data Formats
100% (1)
RP Data Formats
38 pages
The Flexlink Product Range: Flexlink Products in This Catalogue
No ratings yet
The Flexlink Product Range: Flexlink Products in This Catalogue
6 pages
CSRF Answers
No ratings yet
CSRF Answers
5 pages
MX3EG1 (25A) Test Report Rev 1
No ratings yet
MX3EG1 (25A) Test Report Rev 1
11 pages
Deeplearning 4 J
No ratings yet
Deeplearning 4 J
5 pages
Fridahandbook
No ratings yet
Fridahandbook
197 pages
HL06 17 GE Top Load Laundry Bulletin
No ratings yet
HL06 17 GE Top Load Laundry Bulletin
2 pages
CS601 Quiz-2 File by Vu Topper RM
No ratings yet
CS601 Quiz-2 File by Vu Topper RM
59 pages
Scrapy
No ratings yet
Scrapy
298 pages
(MYSQL Advanced) (CheatSheet)
No ratings yet
(MYSQL Advanced) (CheatSheet)
10 pages
Software Engineering Cheat Sheet: by Via
No ratings yet
Software Engineering Cheat Sheet: by Via
2 pages
Python
No ratings yet
Python
15 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Instant Download Violent Python a cookbook for hackers forensic analysts penetration testers and security engineers 1st Edition O'Connor PDF All Chapters
100% (5)
Instant Download Violent Python a cookbook for hackers forensic analysts penetration testers and security engineers 1st Edition O'Connor PDF All Chapters
65 pages
Scenario A (URL) : Lumenci Technical Assessment #3: Firefox iOS
No ratings yet
Scenario A (URL) : Lumenci Technical Assessment #3: Firefox iOS
8 pages
Flask Python
No ratings yet
Flask Python
17 pages
Fillable Planner Citra Ayu (Color) FORM PDF
100% (6)
Fillable Planner Citra Ayu (Color) FORM PDF
11 pages
4 Mysql2
100% (1)
4 Mysql2
29 pages
Full download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz pdf docx
No ratings yet
Full download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz pdf docx
41 pages
The Data Acess HandBook
No ratings yet
The Data Acess HandBook
357 pages
06.05.23 - Python - Web Scraping in Python
No ratings yet
06.05.23 - Python - Web Scraping in Python
108 pages
Attacking Image Based Captchas Using Image Recognition Techniques
No ratings yet
Attacking Image Based Captchas Using Image Recognition Techniques
16 pages
Python Specialization2
No ratings yet
Python Specialization2
3 pages
Allegro PCB Editor User Guide 17.2-2016 S603 PDF
0% (1)
Allegro PCB Editor User Guide 17.2-2016 S603 PDF
339 pages
Data Science For Financial Markets - Kaggle
No ratings yet
Data Science For Financial Markets - Kaggle
202 pages
WT Notes 1st & 2nd Unit
No ratings yet
WT Notes 1st & 2nd Unit
60 pages
Cross Site Scripting (XSS)
No ratings yet
Cross Site Scripting (XSS)
32 pages
DS Toolbox DataScienceGenius
No ratings yet
DS Toolbox DataScienceGenius
1 page
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
String Lists Tuples Dictionaries: Module-2
No ratings yet
String Lists Tuples Dictionaries: Module-2
166 pages
File Extension .TXT Details
No ratings yet
File Extension .TXT Details
6 pages
LinuxLeo 4.97
No ratings yet
LinuxLeo 4.97
353 pages
Recon NG
No ratings yet
Recon NG
15 pages
How To Approach A Crackme
No ratings yet
How To Approach A Crackme
4 pages
ISABMVol 1
No ratings yet
ISABMVol 1
679 pages
Introduction To The Python Programming Language
No ratings yet
Introduction To The Python Programming Language
41 pages
PyTorch For Machine Learning
No ratings yet
PyTorch For Machine Learning
5 pages
Rptgen PDF
No ratings yet
Rptgen PDF
986 pages
CSS3
100% (1)
CSS3
25 pages
List of Top 80 Best Companies in Ambattur Industrial Estate
100% (2)
List of Top 80 Best Companies in Ambattur Industrial Estate
6 pages
Analyzing Social Media Data in Python Chapter1
No ratings yet
Analyzing Social Media Data in Python Chapter1
21 pages
Python Specialization4
No ratings yet
Python Specialization4
3 pages
Python PPT 01
No ratings yet
Python PPT 01
286 pages
SQL Alchemy
No ratings yet
SQL Alchemy
1,088 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
FTU Crypto - Complete Guide On Following Trades With Us
No ratings yet
FTU Crypto - Complete Guide On Following Trades With Us
8 pages
Some Useful Linux Commands
No ratings yet
Some Useful Linux Commands
21 pages
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Uploaded by

SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF

Uploaded by

WEBSCRAPING:

We may be more experienced or familiar with some libraries, or we

BeautifulSoup on the other hand can be more simple, and might

Moreover, BeautifulSoups use of parsers such as LXML make it highly

CAN YOU USE SCRAPY AND BEAUTIFULSOUP TOGETHER?

from bs4 import BeautifulSoup

def parse(self, response):

Also, we have mentioned that you can use BeautifulSoup and

Login with Scapy:

def parse(self, response):

def after_login(self, response):

# continue scraping with authenticated session...

When looking to use Selenium for automation, both libraries can be

In the course we saw the use of Selenium with BeautifulSoup. With

Name = “Test Spider”

def parse(self, response):

By examining this snippet we can see that adding selenium

Overall, I think we can simplify the choice:

BeautifulSoup - to keep things simple, write fewer lines of code with

With this information in hand, you will be able to decide on the

You might also like