100% found this document useful (3 votes)

25 views

(Ebook) Web Scraping with Python by Ryan Mitchell ISBN 9781491910276, 1491910275 - The full ebook version is just one click away

Mitchell

Uploaded by

hastiiijaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

25 views

(Ebook) Web Scraping with Python by Ryan Mitchell ISBN 9781491910276, 1491910275 - The full ebook version is just one click away

Mitchell

Uploaded by

hastiiijaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Instant Ebook Access, One Click Away – Begin at ebooknice.

com

(Ebook) Web Scraping with Python by Ryan Mitchell

ISBN 9781491910276, 1491910275

https://ebooknice.com/product/web-scraping-with-
python-42892068

OR CLICK BUTTON

DOWLOAD EBOOK

Get Instant Ebook Downloads – Browse at https://ebooknice.com

Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Web Scraping with Python by Ryan Mitchell

https://ebooknice.com/product/web-scraping-with-python-55694830

ebooknice.com

(Ebook) Web Scraping with Python by Ryan Mitchell ISBN 9781491985564, 1491985569

https://ebooknice.com/product/web-scraping-with-python-10811388

ebooknice.com

(Ebook) Web Scraping with Python: Data Extraction from the Modern Web by Ryan
Mitchell

https://ebooknice.com/product/web-scraping-with-python-data-extraction-from-the-
modern-web-56680970

ebooknice.com

(Ebook) Web Scraping with Python: Collecting Data from the Modern Web by Ryan
Mitchell ISBN 9781491910290, 1491910291

https://ebooknice.com/product/web-scraping-with-python-collecting-data-from-the-
modern-web-5151034

ebooknice.com
(Ebook) Web scraping with Python: collecting more data from the modern web by Ryan
E. Mitchell ISBN 9781427027276, 9781491985571, 1427027277, 1491985577

https://ebooknice.com/product/web-scraping-with-python-collecting-more-data-
from-the-modern-web-11897446

ebooknice.com

(Ebook) Python web scraping: fetching data from the web by Jarmul, Katharine;Lawson,
Richard ISBN 9781786462589, 9781786464293, 1786462583, 1786464292

https://ebooknice.com/product/python-web-scraping-fetching-data-from-the-
web-11793486

ebooknice.com

(Ebook) Practical Web Scraping for Data Science: Best Practices and Examples with
Python by Seppe vanden Broucke, Bart Baesens ISBN 9781484235812, 1484235819

https://ebooknice.com/product/practical-web-scraping-for-data-science-best-
practices-and-examples-with-python-7008218

ebooknice.com

(Ebook) Flask Web Development: Developing Web Applications with Python by Miguel
Grinberg ISBN 9781449372620, 1449372627

https://ebooknice.com/product/flask-web-development-developing-web-applications-
with-python-4680280

ebooknice.com

(Ebook) Flask Web Development: Developing Web Applications With Python by Miguel
Grinberg ISBN 9781491991732, 1491991739

https://ebooknice.com/product/flask-web-development-developing-web-applications-
with-python-10539006

ebooknice.com
Web Scraping with Python
Collecting Data from the Modern Web

Ryan Mitchell

Boston
Web Scraping with Python
by Ryan Mitchell
Copyright © 2015 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].

Editors: Simon St. Laurent and Allyson MacDonald Indexer: Lucie Haskins
Production Editor: Shiny Kalapurakkel Interior Designer: David Futato
Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery
Proofreader: Carla Thornton Illustrator: Rebecca Demarest

June 2015: First Edition

Revision History for the First Edition

2015-06-10: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491910276 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Web Scraping with Python, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.

978-1-491-91027-6
[LSI]
Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Part I. Building Scrapers

1. Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Connecting 3
An Introduction to BeautifulSoup 6
Installing BeautifulSoup 6
Running BeautifulSoup 8
Connecting Reliably 9

2. Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

You Don’t Always Need a Hammer 13
Another Serving of BeautifulSoup 14
find() and findAll() with BeautifulSoup 16
Other BeautifulSoup Objects 18
Navigating Trees 18
Regular Expressions 22
Regular Expressions and BeautifulSoup 27
Accessing Attributes 28
Lambda Expressions 28
Beyond BeautifulSoup 29

3. Starting to Crawl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Traversing a Single Domain 31
Crawling an Entire Site 35
Collecting Data Across an Entire Site 38
Crawling Across the Internet 40
Crawling with Scrapy 45

4. Using APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
How APIs Work 50

iii
Common Conventions 50
Methods 51
Authentication 52
Responses 52
API Calls 53
Echo Nest 54
A Few Examples 54
Twitter 55
Getting Started 56
A Few Examples 57
Google APIs 60
Getting Started 60
A Few Examples 61
Parsing JSON 63
Bringing It All Back Home 64
More About APIs 68

5. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Media Files 71
Storing Data to CSV 74
MySQL 76
Installing MySQL 77
Some Basic Commands 79
Integrating with Python 82
Database Techniques and Good Practice 85
“Six Degrees” in MySQL 87
Email 90

6. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Document Encoding 93
Text 94
Text Encoding and the Global Internet 94
CSV 98
Reading CSV Files 98
PDF 100
Microsoft Word and .docx 102

Part II. Advanced Scraping

7. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Cleaning in Code 109

iv | Table of Contents
Data Normalization 112
Cleaning After the Fact 113
OpenRefine 114

8. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Summarizing Data 120
Markov Models 123
Six Degrees of Wikipedia: Conclusion 126
Natural Language Toolkit 129
Installation and Setup 129
Statistical Analysis with NLTK 130
Lexicographical Analysis with NLTK 132
Additional Resources 136

9. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Python Requests Library 137
Submitting a Basic Form 138
Radio Buttons, Checkboxes, and Other Inputs 140
Submitting Files and Images 141
Handling Logins and Cookies 142
HTTP Basic Access Authentication 144
Other Form Problems 144

10. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A Brief Introduction to JavaScript 148
Common JavaScript Libraries 149
Ajax and Dynamic HTML 151
Executing JavaScript in Python with Selenium 152
Handling Redirects 158

11. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Overview of Libraries 162
Pillow 162
Tesseract 163
NumPy 164
Processing Well-Formatted Text 164
Scraping Text from Images on Websites 166
Reading CAPTCHAs and Training Tesseract 169
Training Tesseract 171
Retrieving CAPTCHAs and Submitting Solutions 174

Table of Contents | v
12. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A Note on Ethics 177
Looking Like a Human 178
Adjust Your Headers 179
Handling Cookies 181
Timing Is Everything 182
Common Form Security Features 183
Hidden Input Field Values 183
Avoiding Honeypots 184
The Human Checklist 186

13. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

An Introduction to Testing 189
What Are Unit Tests? 190
Python unittest 190
Testing Wikipedia 191
Testing with Selenium 193
Interacting with the Site 194
Unittest or Selenium? 197

14. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Why Use Remote Servers? 199
Avoiding IP Address Blocking 199
Portability and Extensibility 200
Tor 201
PySocks 202
Remote Hosting 203
Running from a Website Hosting Account 203
Running from the Cloud 204
Additional Resources 206
Moving Forward 206

A. Python at a Glance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

B. The Internet at a Glance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

C. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

vi | Table of Contents
Preface

To those who have not developed the skill, computer programming can seem like a
kind of magic. If programming is magic, then web scraping is wizardry; that is, the
application of magic for particularly impressive and useful—yet surprisingly effortless
—feats.
In fact, in my years as a software engineer, I’ve found that very few programming
practices capture the excitement of both programmers and laymen alike quite like
web scraping. The ability to write a simple bot that collects data and streams it down
a terminal or stores it in a database, while not difficult, never fails to provide a certain
thrill and sense of possibility, no matter how many times you might have done it
before.
It’s unfortunate that when I speak to other programmers about web scraping, there’s a
lot of misunderstanding and confusion about the practice. Some people aren’t sure if
it’s legal (it is), or how to handle the modern Web, with all its JavaScript, multimedia,
and cookies. Some get confused about the distinction between APIs and web scra‐
pers.
This book seeks to put an end to many of these common questions and misconcep‐
tions about web scraping, while providing a comprehensive guide to most common
web-scraping tasks.
Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate con‐
cepts. These code samples are in the public domain, and can be used with or without
attribution (although acknowledgment is always appreciated). All code samples also
will be available on the website for viewing and downloading.

vii
What Is Web Scraping?
The automated gathering of data from the Internet is nearly as old as the Internet
itself. Although web scraping is not a new term, in years past the practice has been
more commonly known as screen scraping, data mining, web harvesting, or similar
variations. General consensus today seems to favor web scraping, so that is the term
I’ll use throughout the book, although I will occasionally refer to the web-scraping
programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other
than a program interacting with an API (or, obviously, through a human using a web
browser). This is most commonly accomplished by writing an automated program
that queries a web server, requests data (usually in the form of the HTML and other
files that comprise web pages), and then parses that data to extract needed informa‐
tion.
In practice, web scraping encompasses a wide variety of programming techniques
and technologies, such as data analysis and information security. This book will cover
the basics of web scraping and crawling (Part I), and delve into some of the advanced
topics in Part II.

Why Web Scraping?

If the only way you access the Internet is through a browser, you’re missing out on a
huge range of possibilities. Although browsers are handy for executing JavaScript,
displaying images, and arranging objects in a more human-readable format (among
other things), web scrapers are excellent at gathering and processing large amounts of
data (among other things). Rather than viewing one page at a time through the nar‐
row window of a monitor, you can view databases spanning thousands or even mil‐
lions of pages at once.
In addition, web scrapers can go places that traditional search engines cannot. A
Google search for “cheapest flights to Boston” will result in a slew of advertisements
and popular flight search sites. Google only knows what these websites say on their
content pages, not the exact results of various queries entered into a flight search
application. However, a well-developed web scraper can chart the cost of a flight to
Boston over time, across a variety of websites, and tell you the best time to buy your
ticket.
You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar
with APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your
purposes. They can provide a convenient stream of well-formatted data from one
server to another. You can find an API for many different types of data you might

viii | Preface
want to use such as Twitter posts or Wikipedia pages. In general, it is preferable to use
an API (if one exists), rather than build a bot to get the same data. However, there are
several reasons why an API might not exist:

• You are gathering data across a collection of sites that do not have a cohesive API.
• The data you want is a fairly small, finite set that the webmaster did not think
warranted an API.
• The source does not have the infrastructure or technical ability to create an API.
Even when an API does exist, request volume and rate limits, the types of data, or the
format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your
browser, you can access it via a Python script. If you can access it in a script, you can
store it in a database. And if you can store it in a database, you can do virtually any‐
thing with that data.
There are obviously many extremely practical applications of having access to nearly
unlimited data: market forecasting, machine language translation, and even medical
diagnostics have benefited tremendously from the ability to retrieve and analyze data
from news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The
2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of
English-language blog sites for phrases starting with “I feel” or “I am feeling.” This led
to a popular data visualization, describing how the world was feeling day by day and
minute by minute.
Regardless of your field, there is almost always a way web scraping can guide business
practices more effectively, improve productivity, or even branch off into a brand-new
field entirely.

About This Book

This book is designed to serve not only as an introduction to web scraping, but as a
comprehensive guide to scraping almost every type of data from the modern Web.
Although it uses the Python programming language, and covers many Python basics,
it should not be used as an introduction to the language.
If you are not an expert programmer and don’t know any Python at all, this book
might be a bit of a challenge. If, however, you are an experienced programmer, you
should find the material easy to pick up. Appendix A covers installing and working
with Python 3.x, which is used throughout this book. If you have only used Python
2.x, or do not have 3.x installed, you might want to review Appendix A.

Preface | ix
If you’re looking for a more comprehensive Python resource, the book Introducing
Python by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter
attention spans, the video series Introduction to Python by Jessica McKellar is an
excellent resource.
Appendix C includes case studies, as well as a breakdown of key issues that might
affect how you can legally run scrapers in the United States and use the data that they
produce.
Technical books are often able to focus on a single language or technology, but web
scraping is a relatively disparate subject, with practices that require the use of databa‐
ses, web servers, HTTP, HTML, Internet security, image processing, data science, and
other tools. This book attempts to cover all of these to an extent for the purpose of
gathering data from remote sources across the Internet.
Part I covers the subject of web scraping and web crawling in depth, with a strong
focus on a small handful of libraries used throughout the book. Part I can easily be
used as a comprehensive reference for these libraries and techniques (with certain
exceptions, where additional references will be provided).
Part II covers additional subjects that the reader might find useful when writing web
scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a
single chapter. Because of this, frequent references will be made to other resources
for additional information.
The structure of this book is arranged to be easy to jump around among chapters to
find only the web-scraping technique or information that you are looking for. When
a concept or piece of code builds on another mentioned in a previous chapter, I will
explicitly reference the section that it was addressed in.

Conventions Used in This Book

The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed by the user.

x | Preface
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.

This element signifies a tip or suggestion.

This element signifies a general note.

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at
http://pythonscraping.com/code/.
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not
need to contact us for permission unless you’re reproducing a significant portion of
the code. For example, writing a program that uses several chunks of code from this
book does not require permission. Selling or distributing a CD-ROM of examples
from O’Reilly books does require permission. Answering a question by citing this
book and quoting example code does not require permission. Incorporating a signifi‐
cant amount of example code from this book into your product’s documentation does
require permission.
We appreciate, but do not require, attribution. An attribution usually includes the
title, author, publisher, and ISBN. For example: “Web Scraping with Python by Ryan
Mitchell (O’Reilly). Copyright 2015 Ryan Mitchell, 978-1-491-91029-0.”
If you feel your use of code examples falls outside fair use or the permission given
here, feel free to contact us at [email protected].

Preface | xi
Safari® Books Online
Safari Books Online is an on-demand digital library that deliv‐
ers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi‐
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable
database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-
Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
Course Technology, and dozens more. For more information about Safari Books
Online, please visit us online.

How to Contact Us
Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/1ePG2Uj.
To comment or ask technical questions about this book, send email to bookques‐
[email protected].
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

xii | Preface
Acknowledgments
Just like some of the best products arise out of a sea of user feedback, this book could
have never existed in any useful form without the help of many collaborators, cheer‐
leaders, and editors. Thank you to the O’Reilly staff and their amazing support for
this somewhat unconventional subject, to my friends and family who have offered
advice and put up with impromptu readings, and to my coworkers at LinkeDrive who
I now likely owe many hours of work to.
Thank you, in particular, to Allyson MacDonald, Brian Anderson, Miguel Grinberg,
and Eric VanWyk for their feedback, guidance, and occasional tough love. Quite a few
sections and code samples were written as a direct result of their inspirational sugges‐
tions.
Thank you to Yale Specht for his limitless patience throughout the past nine months,
providing the initial encouragement to pursue this project, and stylistic feedback dur‐
ing the writing process. Without him, this book would have been written in half the
time but would not be nearly as useful.
Finally, thanks to Jim Waldo, who really started this whole thing many years ago
when he mailed a Linux box and The Art and Science of C to a young and impression‐
able teenager.

Preface | xiii
PART I
Building Scrapers

This section focuses on the basic mechanics of web scraping: how to use Python to
request information from a web server, how to perform basic handling of the server’s
response, and how to begin interacting with a website in an automated fashion. By
the end, you’ll be cruising around the Internet with ease, building scrapers that can
hop from one domain to another, gather information, and store that information for
later use.
To be honest, web scraping is a fantastic field to get into if you want a huge payout for
relatively little upfront investment. In all likelihood, 90% of web scraping projects
you’ll encounter will draw on techniques used in just the next six chapters. This sec‐
tion covers what the general (albeit technically savvy) public tends to think of when
they think of “web scrapers”:

• Retrieving HTML data from a domain name

• Parsing that data for target information
• Storing the target information
• Optionally, moving to another page to repeat the process
This will give you a solid foundation before moving on to more complex projects in
part II. Don’t be fooled into thinking that this first section isn’t as important as some
of the more advanced projects in the second half. You will use nearly all the informa‐
tion in the first half of this book on a daily basis while writing web scrapers.
CHAPTER 1
Your First Web Scraper

Once you start web scraping, you start to appreciate all the little things that browsers
do for us. The Web, without a layer of HTML formatting, CSS styling, JavaScript exe‐
cution, and image rendering, can look a little intimidating at first, but in this chapter,
as well as the next one, we’ll cover how to format and interpret data without the help
of a browser.
This chapter will start with the basics of sending a GET request to a web server for a
specific page, reading the HTML output from that page, and doing some simple data
extraction in order to isolate the content that we are looking for.

Connecting
If you haven’t spent much time in networking, or network security, the mechanics of
the Internet might seem a little mysterious. We don’t want to think about what,
exactly, the network is doing every time we open a browser and go to http://
google.com, and, these days, we don’t have to. In fact, I would argue that it’s fantastic
that computer interfaces have advanced to the point where most people who use the
Internet don’t have the faintest idea about how it works.
However, web scraping requires stripping away some of this shroud of interface, not
just at the browser level (how it interprets all of this HTML, CSS, and JavaScript), but
occasionally at the level of the network connection.
To give you some idea of the infrastructure required to get information to your
browser, let’s use the following example. Alice owns a web server. Bob uses a desktop
computer, which is trying to connect to Alice’s server. When one machine wants to
talk to another machine, something like the following exchange takes place:

3
1. Bob’s computer sends along a stream of 1 and 0 bits, indicated by high and low
voltages on a wire. These bits form some information, containing a header and
body. The header contains an immediate destination of his local router’s MAC
address, with a final destination of Alice’s IP address. The body contains his
request for Alice’s server application.
2. Bob’s local router receives all these 1’s and 0’s and interprets them as a packet,
from Bob’s own MAC address, and destined for Alice’s IP address. His router
stamps its own IP address on the packet as the “from” IP address, and sends it off
across the Internet.
3. Bob’s packet traverses several intermediary servers, which direct his packet
toward the correct physical/wired path, on to Alice’s server.
4. Alice’s server receives the packet, at her IP address.
5. Alice’s server reads the packet port destination (almost always port 80 for web
applications, this can be thought of as something like an “apartment number” for
packet data, where the IP address is the “street address”), in the header, and
passes it off to the appropriate application – the web server application.
6. The web server application receives a stream of data from the server processor.
This data says something like:
- This is a GET request
- The following file is requested: index.html
7. The web server locates the correct HTML file, bundles it up into a new packet to
send to Bob, and sends it through to its local router, for transport back to Bob’s
machine, through the same process.
And voilà! We have The Internet.
So, where in this exchange did the web browser come into play? Absolutely nowhere.
In fact, browsers are a relatively recent invention in the history of the Internet, when
Nexus was released in 1990.
Yes, the web browser is a very useful application for creating these packets of infor‐
mation, sending them off, and interpreting the data you get back as pretty pic‐
tures, sounds, videos, and text. However, a web browser is just code, and code can be
taken apart, broken into its basic components, re-written, re-used, and made to do
anything we want. A web browser can tell the processor to send some data to the
application that handles your wireless (or wired) interface, but many languages have
libraries that can do that just as well.
Let’s take a look at how this is done in Python:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
You can save this code as scrapetest.py and run it in your terminal using the com‐
mand:

4 | Chapter 1: Your First Web Scraper

$python scrapetest.py
Note that if you also have Python 2.x installed on your machine, you may need to
explicitly call Python 3.x by running the command this way:
$python3 scrapetest.py
This will output the complete HTML code for the page at http://pythonscraping.com/
pages/page1.html. More accurately, this outputs the HTML file page1.html, found in
the directory <web root>/pages, on the server located at the domain name http://
pythonscraping.com.
What’s the difference? Most modern web pages have many resource files associated
with them. These could be image files, JavaScript files, CSS files, or any other content
that the page you are requesting is linked to. When a web browser hits a tag such as
<img src="cuteKitten.jpg">, the browser knows that it needs to make another
request to the server to get the data at the file cuteKitten.jpg in order to fully render
the page for the user. Keep in mind that our Python script doesn’t have the logic to go
back and request multiple files (yet);it can only read the single HTML file that we’ve
requested.
So how does it do this? Thanks to the plain-English nature of Python, the line
from urllib.request import urlopen
means what it looks like it means: it looks at the Python module request (found
within the urllib library) and imports only the function urlopen.

urllib or urllib2?
If you’ve used the urllib2 library in Python 2.x, you might have
noticed that things have changed somewhat between urllib2 and
urllib. In Python 3.x, urllib2 was renamed urllib and was split into
several submodules: urllib.request, urllib.parse, and url
lib.error. Although function names mostly remain the same, you
might want to note which functions have moved to submodules
when using the new urllib.

urllib is a standard Python library (meaning you don’t have to install anything extra
to run this example) and contains functions for requesting data across the web, han‐
dling cookies, and even changing metadata such as headers and your user agent. We
will be using urllib extensively throughout the book, so we recommend you read the
Python documentation for the library (https://docs.python.org/3/library/urllib.html).
urlopen is used to open a remote object across a network and read it. Because it is a
fairly generic library (it can read HTML files, image files, or any other file stream
with ease), we will be using it quite frequently throughout the book.

Connecting | 5
An Introduction to BeautifulSoup
“Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!”

The BeautifulSoup library was named after a Lewis Carroll poem of the same name in
Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called
the Mock Turtle (itself a pun on the popular Victorian dish Mock Turtle Soup made
not of turtle but of cow).
Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical;
it helps format and organize the messy web by fixing bad HTML and presenting us
with easily-traversible Python objects representing XML structures.

Installing BeautifulSoup
Because the BeautifulSoup library is not a default Python library, it must be installed.
We will be using the BeautifulSoup 4 library (also known as BS4) throughout this
book. The complete instructions for installing BeautifulSoup 4 can be found at
Crummy.com; however, the basic method for Linux is:
$sudo apt-get install python-bs4
and for Macs:
$sudo easy_install pip
This installs the Python package manager pip. Then run the following:
$pip install beautifulsoup4
to install the library.
Again, note that if you have both Python 2.x and 3.x installed on your machine, you
might need to call python3 explicitly:
$python3 myScript.py
Make sure to also use this when installing packages, or the packages might be
installed under Python 2.x, but not Python 3.x:
$sudo python3 setup.py install

If using pip, you can also call pip3 to install the Python 3.x versions of packages:
$pip3 install beautifulsoup4

6 | Chapter 1: Your First Web Scraper

Installing packages in Windows is nearly identical to the process for the Mac and
Linux. Download the most recent BeautifulSoup 4 release from the download URL
above, navigate to the directory you unzipped it to, and run:
>python setup.py install
And that’s it! BeautifulSoup will now be recognized as a Python library on your
machine. You can test this out by opening a Python terminal and importing it:
$python
> from bs4 import BeautifulSoup
The import should complete without errors.

In addition, there is an .exe installer for pip on Windows, so you can easily install and
manage packages:
>pip install beautifulsoup4

Keeping Libraries Straight with Virtual Environments

If you intend to work on multiple Python projects or you need a way to easily bundle
projects with all associated libraries, or you’re worried about potential conflicts
between installed libraries, you can install a Python virtual environment to keep
everything separated and easy to manage.
When you install a Python library without a virtual environment, you are installing
it globally. This usually requires that you be an administrator, or run as root, and that
Python library exists for every user and every project on the machine. Fortunately,
creating a virtual environment is easy:
$ virtualenv scrapingEnv
This creates a new environment called scrapingEnv, which you must activate in order
to use:
$ cd scrapingEnv/
$ source bin/activate
After you have activated the environment, you will see that environment’s name in
your command line prompt, reminding you that you’re currently working with it.
Any libraries you install or scripts you run will be under that virtual environment
only.
Working in the newly-created scrapingEnv environment, I can install and use Beauti‐
fulSoup, for instance:
(scrapingEnv)ryan$ pip install beautifulsoup4
(scrapingEnv)ryan$ python
> from bs4 import BeautifulSoup
>

An Introduction to BeautifulSoup | 7
I can leave the environment with the deactivate command, after which I can no
longer access any libraries that were installed inside the virtual environment:
(scrapingEnv)ryan$ deactivate
ryan$ python
> from bs4 import BeautifulSoup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'bs4'
Keeping all your libraries separated by project also makes it easy to zip up the entire
environment folder and send it to someone else. As long as they have the same ver‐
sion of Python installed on their machine, your code will work from the virtual envi‐
ronment without requiring them to install any libraries themselves.
Although we won’t explicitly instruct you to use a virtual environment in all of this
book’s examples, keep in mind that you can apply virtual environment any time sim‐
ply by activating it beforehand.

Running BeautifulSoup
The most commonly used object in the BeautifulSoup library is, appropriately, the
BeautifulSoup object. Let’s take a look at it in action, modifying the example found in
the beginning of this chapter:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
The output is:
<h1>An Interesting Title</h1>
As in the example before, we are importing the urlopen library and calling
html.read() in order to get the HTML content of the page. This HTML content is
then transformed into a BeautifulSoup object, with the following structure:

• html → <html><head>...</head><body>...</body></html>
— head → <head><title>A Useful Page<title></head>
— title → <title>A Useful Page</title>
— body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
— h1 → <h1>An Interesting Title</h1>
— div → <div>Lorem Ipsum dolor...</div>

8 | Chapter 1: Your First Web Scraper

Note that the <h1> tag that we extracted from the page was nested two layers deep
into our BeautifulSoup object structure (html → body → h1). However, when we
actually fetched it from the object, we called the h1 tag directly:
bsObj.h1
In fact, any of the following function calls would produce the same output:
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

We hope this small taste of BeautifulSoup has given you an idea of the power and
simplicity of this library. Virtually any information can be extracted from any HTML
(or XML) file, as long as it has some identifying tag surrounding it, or near it. In
chapter 3, we’ll delve more deeply into some more-complex BeautifulSoup function
calls, as well as take a look at regular expressions and how they can be used with Beau
tifulSoup in order to extract information from websites.

Connecting Reliably
The web is messy. Data is poorly formatted, websites go down, and closing tags go
missing. One of the most frustrating experiences in web scraping is to go to sleep
with a scraper running, dreaming of all the data you’ll have in your database the next
day—only to find out that the scraper hit an error on some unexpected data format
and stopped execution shortly after you stopped looking at the screen. In situations
like these, you might be tempted to curse the name of the developer who created the
website (and the oddly formatted data), but the person you should really be kicking is
yourself, for not anticipating the exception in the first place!
Let’s take a look at the first line of our scraper, after the import statements, and figure
out how to handle any exceptions this might throw:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
There are two main things that can go wrong in this line:

• The page is not found on the server (or there was some error in retrieving it)
• The server is not found
In the first situation, an HTTP error will be returned. This HTTP error may be “404
Page Not Found,” “500 Internal Server Error,” etc. In all of these cases, the urlopen
function will throw the generic exception “HTTPError” We can handle this exception
in the following way:

An Introduction to BeautifulSoup | 9
try:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
print(e)
#return null, break, or do some other "Plan B"
else:
#program continues. Note: If you return or break in the
#exception catch, you do not need to use the "else" statement
If an HTTP error code is returned, the program now prints the error, and does not
execute the rest of the program under the else statement.
If the server is not found at all (if, say, http://www.pythonscraping.com was down, or
the URL was mistyped), urlopen returns a None object. This object is analogous to
null in other programming languages. We can add a check to see if the returned html
is None:
if html is None:
print("URL is not found")
else:
#program continues
Of course, if the page is retrieved successfully from the server, there is still the issue of
the content on the page not quite being what we expected. Every time you access a tag
in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually
exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a
None object. The problem is, attempting to access a tag on a None object itself will
result in an AttributeError being thrown.
The following line (where nonExistentTag is a made-up tag, not the name of a real
BeautifulSoup function):
print(bsObj.nonExistentTag)

returns a None object. This object is perfectly reasonable to handle and check for. The
trouble comes if you don’t check for it, but instead go on and try to call some other
function on the None object, as illustrated in the following:
print(bsObj.nonExistentTag.someTag)
which returns the exception:
AttributeError: 'NoneType' object has no attribute 'someTag'
So how can we guard against these two situations? The easiest way is to explicitly
check for both situations:
try:
badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
print("Tag was not found")
else:

10 | Chapter 1: Your First Web Scraper

if badContent == None:
print ("Tag was not found")
else:
print(badContent)
This checking and handling of every error does seem laborious at first, but it’s easy to
add a little reorganization to this code to make it less difficult to write (and, more
importantly, much less difficult to read). This code, for example, is our same scraper
written in a slightly different way:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
print("Title could not be found")
else:
print(title)

In this example, we’re creating a function getTitle, which returns either the title of
the page, or a None object if there was some problem with retrieving it. Inside getTi
tle, we check for an HTTPError, as in the previous example, and also encapsulate two
of the BeautifulSoup lines inside one try statement. An AttributeError might be
thrown from either of these lines (if the server did not exist, html would be a None
object, and html.read() would throw an AttributeError). We could, in fact,
encompass as many lines as we wanted inside one try statement, or call another func‐
tion entirely, which can throw an AttributeError at any point.
When writing scrapers, it’s important to think about the overall pattern of your code
in order to handle exceptions and make it readable at the same time. You’ll also likely
want to heavily reuse code. Having generic functions such as getSiteHTML and getTi
tle (complete with thorough exception handling) makes it easy to quickly—and reli‐
ably—scrape the web.

An Introduction to BeautifulSoup | 11
CHAPTER 2
Advanced HTML Parsing

When Michelangelo was asked how he could sculpt a work of art as masterful as his
David, he is famously reported to have said: “It is easy. You just chip away the stone
that doesn’t look like David.”
Although web scraping is unlike marble sculpting in most other respects, we must
take a similar attitude when it comes to extracting the information we’re seeking from
complicated web pages. There are many techniques to chip away the content that
doesn’t look like the content that we’re searching for, until we arrive at the informa‐
tion we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages
in order to extract only the information we’re looking for.

You Don’t Always Need a Hammer

It can be tempting, when faced with a Gordian Knot of tags, to dive right in and use
multiline statements to try to extract your information. However, keep in mind that
layering the techniques used in this section with reckless abandon can lead to code
that is difficult to debug, fragile, or both. Before getting started, let’s take a look at
some of the ways you can avoid altogether the need for advanced HTML parsing!
Let’s say you have some target content. Maybe it’s a name, statistic, or block of text.
Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML
attributes to be found. Let’s say you dive right in and write something like the follow‐
ing line to attempt extraction:
bsObj.findAll("table")[4].findAll("tr")[2].find("td").findAll("div")[1].find("a")
That doesn’t look so great. In addition to the aesthetics of the line, even the slightest
change to the website by a site administrator might break your web scraper alto‐
gether. So what are your options?

13
• Look for a “print this page” link, or perhaps a mobile version of the site that has
better-formatted HTML (more on presenting yourself as a mobile device—and
receiving mobile site versions—in Chapter 12).
• Look for the information hidden in a JavaScript file. Remember, you might need
to examine the imported JavaScript files in order to do this. For example, I once
collected street addresses (along with latitude and longitude) off a website in a
neatly formatted array by looking at the JavaScript for the embedded Google Map
that displayed a pinpoint over each address.
• This is more common for page titles, but the information might be available in
the URL of the page itself.
• If the information you are looking for is unique to this website for some reason,
you’re out of luck. If not, try to think of other sources you could get this informa‐
tion from. Is there another website with the same data? Is this website displaying
data that it scraped or aggregated from another website?
Especially when faced with buried or poorly formatted data, it’s important not to just
start digging. Take a deep breath and think of alternatives. If you’re certain no alter‐
natives exist, the rest of this chapter is for you.

Another Serving of BeautifulSoup

In Chapter 1, we took a quick look at installing and running BeautifulSoup, as well as
selecting objects one at a time. In this section, we’ll discuss searching for tags by
attributes, working with lists of tags, and parse tree navigation.
Nearly every website you encounter contains stylesheets. Although you might think
that a layer of styling on websites that is designed specifically for browser and human
interpretation might be a bad thing, the advent of CSS is actually a boon for web scra‐
pers. CSS relies on the differentiation of HTML elements that might otherwise have
the exact same markup in order to style them differently. That is, some tags might
look like this:
<span class="green"></span>
while others look like this:
<span class="red"></span>
Web scrapers can easily separate these two different tags based on their class; for
example, they might use BeautifulSoup to grab all of the red text but none of the
green text. Because CSS relies on these identifying attributes to style sites appropri‐
ately, you are almost guaranteed that these class and ID attributes will be plentiful on
most modern websites.
Let’s create an example web scraper that scrapes the page located at http://
www.pythonscraping.com/pages/warandpeace.html.

14 | Chapter 2: Advanced HTML Parsing

On this page, the lines spoken by characters in the story are in red, whereas the
names of characters themselves are in green. You can see the span tags, which refer‐
ence the appropriate CSS classes, in the following sample of the page’s source code:
"<span class="red">Heavens! what a virulent attack!</span>" replied <span class=
"green">the prince</span>, not in the least disconcerted by this reception.
We can grab the entire page and create a BeautifulSoup object with it using a program
similar to the one used in Chapter 1:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Using this BeautifulSoup object, we can use the findAll function to extract a Python
list of proper nouns found by selecting only the text within <span class="green"></
span> tags (findAll is an extremely flexible function we’ll be using a lot later in this
book):
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())
When run, it should list all the proper nouns in the text, in the order they appear in
War and Peace. So what’s going on here? Previously, we’ve called bsObj.tagName in
order to get the first occurrence of that tag on the page. Now, we’re calling
bsObj.findAll(tagName, tagAttributes) in order to get a list of all of the tags on
the page, rather than just the first.
After getting a list of names, the program iterates through all names in the list, and
prints name.get_text() in order to separate the content from the tags.

When to get_text() and When to Preserve Tags

.get_text() strips all tags from the document you are working
with and returns a string containing the text only. For example, if
you are working with a large block of text that contains many
hyperlinks, paragraphs, and other tags, all those will be stripped
away and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for
in a BeautifulSoup object than in a block of text. Call‐
ing .get_text() should always be the last thing you do, immedi‐
ately before you print, store, or manipulate your final data. In
general, you should try to preserve the tag structure of a document
as long as possible.

Another Serving of BeautifulSoup | 15

find() and findAll() with BeautifulSoup
BeautifulSoup’s find() and findAll() are the two functions you will likely use the
most. With them, you can easily filter HTML pages to find lists of desired tags, or a
single tag, based on their various attributes.
The two functions are extremely similar, as evidenced by their definitions in the
BeautifulSoup documentation:
findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
In all likelihood, 95% of the time you will find yourself only needing to use the first
two arguments: tag and attributes. However, let’s take a look at all of the arguments
in greater detail.
The tag argument is one that we’ve seen before—you can pass a string name of a tag
or even a Python list of string tag names. For example, the following will return a list
of all the header tags in a document:1
.findAll({"h1","h2","h3","h4","h5","h6"})

The attributes argument takes a Python dictionary of attributes and matches tags
that contain any one of those attributes. For example, the following function would
return both the green and red span tags in the HTML document:
.findAll("span", {"class":"green", "class":"red"})

The recursive argument is a boolean. How deeply into the document do you want to
go? If recursion is set to True, the findAll function looks into children, and child‐
ren’s children, for tags that match your parameters. If it is false, it will look only at
the top-level tags in your document. By default, findAll works recursively (recur
sive is set to True); it’s generally a good idea to leave this as is, unless you really know
what you need to do and performance is an issue.
The text argument is unusual in that it matches based on the text content of the tags,
rather than properties of the tags themselves. For instance, if we want to find the
number of times “the prince” was surrounded by tags on the example page, we could
replace our .findAll() function in the previous example with the following lines:
nameList = bsObj.findAll(text="the prince")
print(len(nameList))
The output of this is “7.”

1 If you’re looking to get a list of all h<some_level> tags in the document, there are more succinct ways of writ‐
ing this code to accomplish the same thing. We’ll take a look at other ways of approaching these types of prob‐
lems in the section BeautifulSoup and regular expressions.

16 | Chapter 2: Advanced HTML Parsing

The limit argument, of course, is only used in the findAll method; find is equiva‐
lent to the same findAll call, with a limit of 1. You might set this if you’re only inter‐
ested in retrieving the first x items from the page. Be aware, however, that this gives
you the first items on the page in the order that they occur, not necessarily the first
ones that you want.
The keyword argument allows you to select tags that contain a particular attribute.
For example:
allText = bsObj.findAll(id="text")
print(allText[0].get_text())

A Caveat to the keyword Argument

The keyword argument can be very helpful in some situations.
However, it is technically redundant as a BeautifulSoup feature.
Keep in mind that anything that can be done with keyword can also
be accomplished using techniques we will discuss later in this chap‐
ter (see Regular Expressions and Lambda Expressions).
For instance, the following two lines are identical:
bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})
In addition, you might occasionally run into problems using key
word, most notably when searching for elements by their class
attribute, because class is a protected keyword in Python. That is,
class is a reserved word in Python that cannot be used as a vari‐
able or argument name (no relation to the BeautifulSoup.findAll()
keyword argument, previously discussed).2 For example, if you try
the following call, you’ll get a syntax error due to the nonstandard
use of class:
bsObj.findAll(class="green")
Instead, you can use BeautifulSoup’s somewhat clumsy solution,
which involves adding an underscore:
bsObj.findAll(class_="green")
Alternatively, you can enclose class in quotes:
bsObj.findAll("", {"class":"green"}

At this point, you might be asking yourself, “But wait, don’t I already know how to get
a list of tags by attribute—by passing attributes to the function in a dictionary list?”

2 The Python Language Reference provides a complete list of protected keywords.

Another Serving of BeautifulSoup | 17

Recall that passing a list of tags to .findAll() via the attributes list acts as an “or”
filter (i.e., it selects a list of all tags that have tag1 or tag2 or tag3...). If you have a
lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword
argument allows you to add an additional “and” filter to this.

Other BeautifulSoup Objects

So far in the book, you’ve seen two types of objects in the BeautifulSoup library:
bsObj.div.h1

BeautifulSoup objects
Seen in previous code examples as bsObj
Tag objects
Retrieved in lists or individually by calling find and findAll on a Beauti
fulSoup object, or drilling down, as in:
However, there are two more objects in the library that, although less commonly
used, are still important to know about:
NavigableString objects
Used to represent text within tags, rather than the tags themselves (some func‐
tions operate on, and produce, NavigableStrings, rather than tag objects).
The Comment object
Used to find HTML comments in comment tags, 
These four objects are the only objects you will ever encounter (as of the time of this
writing) in the BeautifulSoup library.

Navigating Trees
The findAll function is responsible for finding tags based on their name and
attribute. But what if you need to find a tag based on its location in a document?
That’s where tree navigation comes in handy. In Chapter 1, we looked at navigating a
BeautifulSoup tree in a single direction:
bsObj.tag.subTag.anotherSubTag
Now let’s look at navigating up, across, and diagonally through HTML trees using our
highly questionable online shopping site http://www.pythonscraping.com/pages/
page3.html as an example page for scraping (see Figure 2-1):

18 | Chapter 2: Advanced HTML Parsing

Figure 2-1. Screenshot from http://www.pythonscraping.com/pages/page3.html

The HTML for this page, mapped out as a tree (with some tags omitted for brevity),
looks like:

• html
— body
— div.wrapper
— h1
— div.content
— table#giftList
— tr
— th
— th
— th
— th
— tr.gift#gift1
— td
— td

Another Serving of BeautifulSoup | 19

— span.excitingNote
— td
— td
— img
— ...table rows continue...
— div.footer
We will use this same HTML structure as an example in the next few sections.

Dealing with children and other descendants

In computer science and some branches of mathematics, you often hear about horri‐
ble things done to children: moving them, storing them, removing them, and even
killing them. Fortunately, in BeautifulSoup, children are treated differently.
In the BeautifulSoup library, as well as many other libraries, there is a distinction
drawn between children and descendants: much like in a human family tree, children
are always exactly one tag below a parent, whereas descendants can be at any level in
the tree below a parent. For example, the tr tags are children of the table tag,
whereas tr, th, td, img, and span are all descendants of the table tag (at least in our
example page). All children are descendants, but not all descendants are children.
In general, BeautifulSoup functions will always deal with the descendants of the cur‐
rent tag selected. For instance, bsObj.body.h1 selects the first h1 tag that is a
descendant of the body tag. It will not find tags located outside of the body.
Similarly, bsObj.div.findAll("img") will find the first div tag in the document,
then retrieve a list of all img tags that are descendants of that div tag.
If you want to find only descendants that are children, you can use the .children tag:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)

for child in bsObj.find("table",{"id":"giftList"}).children:

print(child)

This code prints out the list of product rows in the giftList table. If you were to
write it using the descendants() function instead of the children() function, about
two dozen tags would be found within the table and printed, including img tags, span
tags, and individual td tags. It’s definitely important to differentiate between children
and descendants!

20 | Chapter 2: Advanced HTML Parsing

Dealing with siblings
The BeautifulSoup next_siblings() function makes it trivial to collect data from
tables, especially ones with title rows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:

print(sibling)
The output of this code is to print all rows of products from the product table, except
for the first title row. Why does the title row get skipped? Two reasons: first, objects
cannot be siblings with themselves. Any time you get siblings of an object, the object
itself will not be included in the list. Second, this function calls next siblings only. If
we were to select a row in the middle of the list, for example, and call next_siblings
on it, only the subsequent (next) siblings would be returned. So, by selecting the title
row and calling next_siblings, we can select all the rows in the table, without select‐
ing the title row itself.

Make Selections Specific

Note that the preceding code will work just as well, if we select
bsObj.table.tr or even just bsObj.tr in order to select the first
row of the table. However, in the code, I go through all of the trou‐
ble of writing everything out in a longer form:
bsObj.find("table",{"id":"giftList"}).tr
Even if it looks like there’s just one table (or other target tag) on the
page, it’s easy to miss things. In addition, page layouts change all
the time. What was once the first of its kind on the page, might
someday be the second or third tag of that type found on the page.
To make your scrapers more robust, it’s best to be as specific as pos‐
sible when making tag selections. Take advantage of tag attributes
when they are available.

As a complement to next_siblings, the previous_siblings function can often be

helpful if there is an easily selectable tag at the end of a list of sibling tags that you
would like to get.
And, of course, there are the next_sibling and previous_sibling functions, which
perform nearly the same function as next_siblings and previous_siblings, except
they return a single tag rather than a list of them.

Another Serving of BeautifulSoup | 21

Dealing with your parents
When scraping pages, you will likely discover that you need to find parents of tags
less frequently than you need to find their children or siblings. Typically, when we
look at HTML pages with the goal of crawling them, we start by looking at the top
layer of tags, and then figure out how to drill our way down into the exact piece of
data that we want. Occasionally, however, you can find yourself in odd situations that
require BeautifulSoup’s parent-finding functions, .parent and .parents. For exam‐
ple:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"
}).parent.previous_sibling.get_text())
This code will print out the price of the object represented by the image at the loca‐
tion ../img/gifts/img1.jpg (in this case, the price is “$15.00”).
How does this work? The following diagram represents the tree structure of the por‐
tion of the HTML page we are working with, with numbered steps:

• <tr>
— <td>
— <td>
— <td>(3)
— “$15.00” (4)
— s<td> (2)
— <img src=”../img/gifts/img1.jpg">(1)

1. The image tag where src="../img/gifts/img1.jpg" is first selected

2. We select the parent of that tag (in this case, the <td> tag).
3. We select the previous_sibling of the <td> tag (in this case, the <td> tag that
contains the dollar value of the product).
4. We select the text within that tag, “$15.00”

Regular Expressions
As the old computer-science joke goes: “Let’s say you have a problem, and you decide
to solve it with regular expressions. Well, now you have two problems.”
Unfortunately, regular expressions (often shortened to regex) are often taught using
large tables of random symbols, strung together to look like a lot of nonsense. This

22 | Chapter 2: Advanced HTML Parsing

tends to drive people away, and later they get out into the workforce and write need‐
lessly complicated searching and filtering functions, when all they needed was a one-
line regular expression in the first place!
Fortunately for you, regular expressions are not all that difficult to get up and run‐
ning with quickly, and can easily be learned by looking at and experimenting with a
few simple examples.
Regular expressions are so called because they are used to identify regular strings;
that is, they can definitively say, “Yes, this string you’ve given me follows the rules,
and I’ll return it,” or “This string does not follow the rules, and I’ll discard it.” This
can be exceptionally handy for quickly scanning large documents to look for strings
that look like phone numbers or email addresses.
Notice that I used the phrase regular string. What is a regular string? It’s any string
that can be generated by a series of linear rules,3 such as:

1. Write the letter “a” at least once.

2. Append to this the letter “b” exactly five times.
3. Append to this the letter “c” any even number of times.
4. Optionally, write the letter “d” at the end.
Strings that follow these rules are: “aaaabbbbbccccd,” “aabbbbbcc,” and so on (there
are an infinite number of variations).
Regular expressions are merely a shorthand way of expressing these sets of rules. For
instance, here’s the regular expression for the series of steps just described:
aa*bbbbb(cc)*(d | )
This string might seem a little daunting at first, but it becomes clearer when we break
it down into its components:
aa*
The letter a is written, followed by a* (read as a star) which means “any number
of a’s, including 0 of them.” In this way, we can guarantee that the letter a is writ‐
ten at least once.
bbbbb
No special effects here—just five b’s in a row.

3 You might be asking yourself, “Are there ‘irregular’ expressions?” Nonregular expressions are beyond the
scope of this book, but they encompass strings such as “write a prime number of a’s, followed by exactly twice
that number of b’s” or “write a palindrome.” It’s impossible to identify strings of this type with a regular
expression. Fortunately, I’ve never been in a situation where my web scraper needed to identify these kinds of
strings.

Regular Expressions | 23
(cc)*
Any even number of things can be grouped into pairs, so in order to enforce this
rule about even things, you can write two c’s, surround them in parentheses, and
write an asterisk after it, meaning that you can have any number of pairs of c’s
(note that this can mean 0 pairs, as well).
(d | )
Adding a bar in the middle of two expressions means that it can be “this thing or
that thing.” In this case, we are saying “add a d followed by a space or just add a
space without a d.” In this way we can guarantee that there is, at most, one d, fol‐
lowed by a space, completing the string.

Experimenting with RegEx

When learning how to write regular expressions, it’s critical to play
around with them and get a feel for how they work.
If you don’t feel like firing up a code editor, writing a few lines, and
running your program in order to see if a regular expression works
as expected, you can go to a website such as RegexPal and test your
regular expressions on the fly.

One classic example of regular expressions can be found in the practice of identifying
email addresses. Although the exact rules governing email addresses vary slightly
from mail server to mail server, we can create a few general rules. The corresponding
regular expression for each of these rules is shown in the second column:

Rule 1 [A-Za-z0-9\._+]+
The first part of an email address The regular expression shorthand is pretty smart. For
contains at least one of the example, it knows that “A-Z” means “any uppercase
following: uppercase letters, letter, A through Z.” By putting all these possible
lowercase letters, the numbers 0-9, sequences and symbols in brackets (as opposed to
periods (.), plus signs (+), or parentheses) we are saying “this symbol can be any one
underscores (_). of these things we’ve listed in the brackets.” Note also
that the + sign means “these characters can occur as
many times as they want to, but must occur at least
once.”

Rule 2 @
After this, the email address contains This is fairly straightforward: the @ symbol must occur in
the @ symbol. the middle, and it must occur exactly once.

24 | Chapter 2: Advanced HTML Parsing

Rule 3 [A-Za-z]+
The email address then must contain We may use only letters in the first part of the domain
at least one uppercase or lowercase name, after the @ symbol. Also, there must be at least
letter. one character.

Rule 4 \.
This is followed by a period (.). You must include a period (.) before the domain name.

Rule 5 (com|org|edu|net)
Finally, the email address ends with This lists the possible sequences of letters that can occur
com, org, edu, or net (in reality, there after the period in the second part of an email address.
are many possible top-level
domains, but, these four should
suffice for the sake of example).

By concatenating all of the rules, we arrive at the regular expression:

[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)
When attempting to write any regular expression from scratch, it’s best to first make a
list of steps that concretely outlines what your target string looks like. Pay attention to
edge cases. For instance, if you’re identifying phone numbers, are you considering
country codes and extensions?.
Table 2-1 lists some commonly used regular expression symbols, with a brief explana‐
tion and example. This list is by no means complete, and as mentioned before, you
might encounter slight variations from language to language. However, these 12 sym‐
bols are the most commonly used regular expressions in Python, and can be used to
find and collect most any string type.

Table 2-1. Commonly used regular expression symbols

Symbol(s) Meaning Example Example Matches
* Matches the preceding character, subexpression, or bracketed character, a*b* aaaaaaaa,
0 or more times aaabbbbb, bbbbbb

+ Matches the preceding character, subexpression, or bracketed character, a+b+ aaaaaaaab,

1 or more times aaabbbbb, abbbbbb

Regular Expressions | 25
[] Matches any character within the brackets (i.e., “Pick any one of these [A-Z]* APPLE,
things”) CAPITALS,
QWERTY
() A grouped subexpression (these are evaluated first, in the “order of (a*b)* aaabaab, abaaab,
operations” of regular expressions) ababaaaaab

{m, n} Matches the preceding character, subexpression, or bracketed character a{2,3}b{2,3} aabbb, aaabbb,
between m and n times (inclusive) aabb
[^] Matches any single character that is not in the brackets [^A-Z]* apple,
lowercase,
qwerty
| Matches any character, string of characters, or subexpression, separated b(a|i|e)d bad, bid, bed
by the “I” (note that this is a vertical bar, or “pipe,” not a capital “i”)
. Matches any single character (including symbols, numbers, a space, etc.) b.d bad, bzd, b$d, b d
^ Indicates that a character or subexpression occurs at the beginning of a ^a apple, asdf, a
string
\ An escape character (this allows you to use “special” characters as their \. \| \\ .|\
literal meaning)
$ Often used at the end of a regular expression, it means “match this up [A-Z]*[a-z]*$ ABCabc, zzzyx, Bob
to the end of the string.” Without it, every regular expression has a
defacto “.*” at the end of it, accepting strings where only the first part
of the string matches. This can be thougt of as analogous to the ^
symbol.
?! “Does not contain.” This odd pairing of symbols, immediately preceding ^((?![A-Z]).)*$ no-caps-here,
a character (or regular expression), indicates that that character should $ymb0ls a4e f!ne
not be found in that specific place in the larger string. This can be tricky
to use; after all, the character might be found in a different part of the
string. If trying to eliminate a character entirely, use in conjunction with
a ^ and $ at either end.

Regular Expressions: Not Always Regular!

The standard version of regular expressions (the one we are cover‐
ing in this book, and that is used by Python and BeautifulSoup) is
based on syntax used by Perl. Most modern programming lan‐
guages use this or one very similar to it. Be aware, however, that if
you are using regular expressions in another language, you might
encounter problems. Even some modern languages, such as Java,
have slight differences in the way they handle regular expressions.
When in doubt, read the docs!

26 | Chapter 2: Advanced HTML Parsing

Regular Expressions and BeautifulSoup
If the previous section on regular expressions seemed a little disjointed from the mis‐
sion of this book, here’s where it all ties together. BeautifulSoup and regular expres‐
sions go hand in hand when it comes to scraping the Web. In fact, most functions
that take in a string argument (e.g., find(id="aTagIdHere")) will also take in a regu‐
lar expression just as well.
Let’s take a look at some examples, scraping the page found at http://www.python‐
scraping.com/pages/page3.html.
Notice that there are many product images on the site—they take the following form:
<img src="../img/gifts/img3.jpg">
If we wanted to grab URLs to all of the product images, it might seem fairly straight‐
forward at first: just grab all the image tags using .findAll("img"), right? But there’s
a problem. In addition to the obvious “extra” images (e.g., logos), modern websites
often have hidden images, blank images used for spacing and aligning elements, and
other random image tags you might not be aware of. Certainly, you can’t count on the
only images on the page being product images.
Let’s also assume that the layout of the page might change, or that, for whatever rea‐
son, we don’t want to depend on the position of the image in the page in order to find
the correct tag. This might be the case when you are trying to grab specific elements
or pieces of data that are scattered randomly throughout a website. For instance,
there might be a featured product image in a special layout at the top of some pages,
but not others.
The solution is to look for something identifying about the tag itself. In this case, we
can look at the file path of the product images:
from urllib.request
import urlopenfrom bs4
import BeautifulSoupimport re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
print(image["src"])
This prints out only the relative image paths that start with ../img/gifts/img and end
in .jpg, the output of which is the following:
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

Regular Expressions and BeautifulSoup | 27

A regular expression can be inserted as any argument in a BeautifulSoup expression,
allowing you a great deal of flexibility in finding target elements.

Accessing Attributes
So far, we’ve looked at how to access and filter tags and access content within them.
However, very often in web scraping you’re not looking for the content of a tag; you’re
looking for its attributes. This becomes especially useful for tags such as <a>, where
the URL it is pointing to is contained within the href attribute, or the <img> tag,
where the target image is contained within the src attribute.
With tag objects, a Python list of attributes can be automatically accessed by calling:
myTag.attrs
Keep in mind that this literally returns a Python dictionary object, which makes
retrieval and manipulation of these attributes trivial. The source location for an
image, for example, can be found using the following line:
myImgTag.attrs['src']

Lambda Expressions
If you have a formal education in computer science, you probably learned about
lambda expressions once in school and then never used them again. If you don’t, they
might be unfamiliar to you (or familiar only as “that thing I’ve been meaning to learn
at some point”). In this section, we won’t go deeply into these extremely useful func‐
tions, but we will look at a few examples of how they can be useful in web scraping.
Essentially, a lambda expression is a function that is passed into another function as a
variable; that is, instead of defining a function as f(x, y), you may define a function as
f(g(x), y), or even f(g(x), h(x)).
BeautifulSoup allows us to pass certain types of functions as parameters into the fin
dAll function. The only restriction is that these functions must take a tag object as an
argument and return a boolean. Every tag object that BeautifulSoup encounters is
evaluated in this function, and tags that evaluate to “true” are returned while the rest
are discarded.
For example, the following retrieves all tags that have exactly two attributes:
soup.findAll(lambda tag: len(tag.attrs) == 2)
That is, it will find tags such as the following:
<div class="body" id="content"></div>
<span style="color:red" class="title"></span>

28 | Chapter 2: Advanced HTML Parsing

Using lambda functions in BeautifulSoup, selectors can act as a great substitute for
writing a regular expression, if you’re comfortable with writing a little code.

Beyond BeautifulSoup
Although BeautifulSoup is used throughout this book (and is one of the most popu‐
lar HTML libraries available for Python), keep in mind that it’s not the only option. If
BeautifulSoup does not meet your needs, check out these other widely used libraries:
lxml
This library is used for parsing both HTML and XML documents, and is known
for being very low level and heavily based on C. Although it takes a while to learn
(a steep learning curve actually means you learn it very fast), it is very fast at
parsing most HTML documents.
HTML Parser
This is Python’s built-in parsing library. Because it requires no installation (other
than, obviously, having Python installed in the first place), it can be extremely
convenient to use.

Beyond BeautifulSoup | 29
Random documents with unrelated
content Scribd suggests to you:
*** END OF THE PROJECT GUTENBERG EBOOK GRAHAM'S
MAGAZINE, VOL. XXXV, NO. 2, AUGUST 1849 ***

Updated editions will replace the previous one—the old editions

will be renamed.

Creating the works from print editions not protected by U.S.

copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the

free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only

be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project

Gutenberg:

1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United

States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is

derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is

posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute

this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,

performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or

providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who

notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of

any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project

Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend

considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except

for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you

discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set

forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the

Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission

of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500

West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws

regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states

where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot

make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current

donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About

Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several

printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

ebooknice.com

CSCI369 Lab 1
No ratings yet
CSCI369 Lab 1
3 pages
1.1 Website Hacking PDF
No ratings yet
1.1 Website Hacking PDF
12 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (2)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Get Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell free all chapters
100% (8)
Get Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell free all chapters
67 pages
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell pdf download
100% (2)
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell pdf download
48 pages
Download full Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell ebook all chapters
No ratings yet
Download full Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell ebook all chapters
67 pages
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell - The ebook in PDF and DOCX formats is ready for download
100% (2)
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell - The ebook in PDF and DOCX formats is ready for download
47 pages
Full download The Darker Side of Leadership Pythons Devouring Crocodiles 1st Edition Manfred F. R. Kets De Vries pdf docx
No ratings yet
Full download The Darker Side of Leadership Pythons Devouring Crocodiles 1st Edition Manfred F. R. Kets De Vries pdf docx
77 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Image Scrapper From Scratch To Proudction
No ratings yet
Image Scrapper From Scratch To Proudction
22 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Getting Started With Beautiful Soup Sample Chapter
No ratings yet
Getting Started With Beautiful Soup Sample Chapter
15 pages
FullStackWebDevelopment Python
No ratings yet
FullStackWebDevelopment Python
9 pages
Scrapy
No ratings yet
Scrapy
171 pages
3252_ids_10
No ratings yet
3252_ids_10
5 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Web Scrapper From Scratch
No ratings yet
Web Scrapper From Scratch
25 pages
Scrapy-Org Documentation
No ratings yet
Scrapy-Org Documentation
352 pages
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
No ratings yet
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
6 pages
Beautiful Soup 4
No ratings yet
Beautiful Soup 4
78 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
354 pages
Scrapy
No ratings yet
Scrapy
306 pages
Scrapy Docs
No ratings yet
Scrapy Docs
197 pages
Id-11659 Scrapping Web
No ratings yet
Id-11659 Scrapping Web
295 pages
Scrapy
No ratings yet
Scrapy
298 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Building Your Own Web Spider: Thoughts, Considerations and Problems
No ratings yet
Building Your Own Web Spider: Thoughts, Considerations and Problems
17 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Scrapy Documentation Guide
No ratings yet
Scrapy Documentation Guide
260 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
382 pages
docs-scrapy-org-en-latest
No ratings yet
docs-scrapy-org-en-latest
405 pages
Scrapy PDF
No ratings yet
Scrapy PDF
250 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Bug Bounty Steps
No ratings yet
Bug Bounty Steps
3 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
230 pages
DWV_labs_2025_1 (1)
No ratings yet
DWV_labs_2025_1 (1)
17 pages
Scrapy
No ratings yet
Scrapy
248 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Scrapy
No ratings yet
Scrapy
230 pages
Python Scrapy
No ratings yet
Python Scrapy
244 pages
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
100% (1)
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
130 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Docs Scrapy Org en Master
No ratings yet
Docs Scrapy Org en Master
405 pages
09.11.17 - Project - CSE 053 06715 - Adnan Ferdous Ashrafi
No ratings yet
09.11.17 - Project - CSE 053 06715 - Adnan Ferdous Ashrafi
41 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
100% (3)
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
329 pages
Notes for Web Scraping - BeautifulSoup-3903
No ratings yet
Notes for Web Scraping - BeautifulSoup-3903
6 pages
Docs Scrapy Org en Master
No ratings yet
Docs Scrapy Org en Master
411 pages
Fullstack Python Development
No ratings yet
Fullstack Python Development
11 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Instant Download Selenium WebDriver Recipes in Python The problem solving guide to Selenium WebDriver in Python 1st Edition Zhimin Zhan PDF All Chapters
No ratings yet
Instant Download Selenium WebDriver Recipes in Python The problem solving guide to Selenium WebDriver in Python 1st Edition Zhimin Zhan PDF All Chapters
41 pages
NumPy Cookbook
From Everand
NumPy Cookbook
Ivan Idris
5/5 (2)
Daftar Nilai Mts Dan Ma
No ratings yet
Daftar Nilai Mts Dan Ma
54 pages
Refresher Course Report
100% (1)
Refresher Course Report
3 pages
Act 2 -
No ratings yet
Act 2 -
26 pages
Lesson 5: Block Diagrams & Signal Flow Graphs
No ratings yet
Lesson 5: Block Diagrams & Signal Flow Graphs
15 pages
Huawei NodeB Data Configuration
100% (3)
Huawei NodeB Data Configuration
57 pages
Twelfth Night
No ratings yet
Twelfth Night
18 pages
Visual Programming Techniques
No ratings yet
Visual Programming Techniques
4 pages
Multi-Class Sentiment Analysis From Afaan Oromo Text Based 3
No ratings yet
Multi-Class Sentiment Analysis From Afaan Oromo Text Based 3
9 pages
Technical Drafting 10
No ratings yet
Technical Drafting 10
20 pages
Meeting 4 - Addition, Subtraction, Division, Multiplication, Fraction, Decimal, Percentage
No ratings yet
Meeting 4 - Addition, Subtraction, Division, Multiplication, Fraction, Decimal, Percentage
32 pages
CS508 Short Notes
No ratings yet
CS508 Short Notes
40 pages
T5 - Gerunds and Infinitives
No ratings yet
T5 - Gerunds and Infinitives
18 pages
Mosaic-2 A Reading Skills Book
No ratings yet
Mosaic-2 A Reading Skills Book
330 pages
DXC Model Placement Paper
No ratings yet
DXC Model Placement Paper
34 pages
meeting_saved_chat
No ratings yet
meeting_saved_chat
11 pages
Latin Phrases
No ratings yet
Latin Phrases
12 pages
HSM Script
No ratings yet
HSM Script
18 pages
DBMS Lab Manual 7-2-2016
100% (1)
DBMS Lab Manual 7-2-2016
93 pages
88 Third-Conditional US
No ratings yet
88 Third-Conditional US
25 pages
Chapter 8 - Multidimensional Arrays - Lecture22
No ratings yet
Chapter 8 - Multidimensional Arrays - Lecture22
29 pages
Damped Pendulum Equation
No ratings yet
Damped Pendulum Equation
3 pages
Erelcla4e1 PDF
No ratings yet
Erelcla4e1 PDF
2 pages
Trend4 - Makerspaces
No ratings yet
Trend4 - Makerspaces
6 pages
How To Solve The Problem: 1. Write Down The Problem. 2. Think Real Hard. 3. Write Down The Solution
No ratings yet
How To Solve The Problem: 1. Write Down The Problem. 2. Think Real Hard. 3. Write Down The Solution
34 pages
Target 70: Imp Questions List
100% (1)
Target 70: Imp Questions List
22 pages
Ratsasan - Wikipedia
0% (1)
Ratsasan - Wikipedia
9 pages
P5 English
100% (1)
P5 English
11 pages
Resume For Job at Apple
100% (1)
Resume For Job at Apple
8 pages
Research Method and Thesis Writing by Calmorin
100% (2)
Research Method and Thesis Writing by Calmorin
5 pages