Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell pdf download
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell pdf download
https://ebookfinal.com/download/web-scraping-with-python-
collecting-data-from-the-modern-web-1st-edition-ryan-mitchell/
https://ebookfinal.com/download/flask-web-development-developing-web-
applications-with-python-1st-edition-miguel-grinberg/
https://ebookfinal.com/download/automated-data-collection-with-r-a-
practical-guide-to-web-scraping-and-text-mining-1st-edition-simon-
munzert/
https://ebookfinal.com/download/the-modern-web-multi-device-web-
development-with-html5-css3-and-javascript-1st-new-edition-edition-
peter-gasston/
https://ebookfinal.com/download/data-mining-the-web-uncovering-
patterns-in-web-content-structure-and-usage-1st-edition-zdravko-
markov/
Learn Java for Web Development Modern Java Web Development
1st Edition Vishal Layka
https://ebookfinal.com/download/learn-java-for-web-development-modern-
java-web-development-1st-edition-vishal-layka/
https://ebookfinal.com/download/data-management-in-the-semantic-
web-1st-edition-hal-jin/
https://ebookfinal.com/download/web-mapping-illustrated-using-open-
source-gis-toolkits-1st-edition-tyler-mitchell/
https://ebookfinal.com/download/building-web-applications-with-erlang-
working-with-rest-and-web-sockets-on-yaws-1st-edition-zachary-kessin/
Web Scraping with Python Collecting Data from the
Modern Web 1st Edition Ryan Mitchell Digital Instant
Download
Author(s): Ryan Mitchell
ISBN(s): 9781491910290, 1491910291
Edition: 1
File Details: PDF, 6.10 MB
Year: 2015
Language: english
Web Scraping with Python
Collecting Data from the Modern Web
Ryan Mitchell
Boston
Web Scraping with Python
by Ryan Mitchell
Copyright © 2015 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].
Editors: Simon St. Laurent and Allyson MacDonald Indexer: Lucie Haskins
Production Editor: Shiny Kalapurakkel Interior Designer: David Futato
Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery
Proofreader: Carla Thornton Illustrator: Rebecca Demarest
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Web Scraping with Python, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-91027-6
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
3. Starting to Crawl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Traversing a Single Domain 31
Crawling an Entire Site 35
Collecting Data Across an Entire Site 38
Crawling Across the Internet 40
Crawling with Scrapy 45
4. Using APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
How APIs Work 50
iii
Common Conventions 50
Methods 51
Authentication 52
Responses 52
API Calls 53
Echo Nest 54
A Few Examples 54
Twitter 55
Getting Started 56
A Few Examples 57
Google APIs 60
Getting Started 60
A Few Examples 61
Parsing JSON 63
Bringing It All Back Home 64
More About APIs 68
5. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Media Files 71
Storing Data to CSV 74
MySQL 76
Installing MySQL 77
Some Basic Commands 79
Integrating with Python 82
Database Techniques and Good Practice 85
“Six Degrees” in MySQL 87
Email 90
6. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Document Encoding 93
Text 94
Text Encoding and the Global Internet 94
CSV 98
Reading CSV Files 98
PDF 100
Microsoft Word and .docx 102
iv | Table of Contents
Data Normalization 112
Cleaning After the Fact 113
OpenRefine 114
Table of Contents | v
12. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A Note on Ethics 177
Looking Like a Human 178
Adjust Your Headers 179
Handling Cookies 181
Timing Is Everything 182
Common Form Security Features 183
Hidden Input Field Values 183
Avoiding Honeypots 184
The Human Checklist 186
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
vi | Table of Contents
Preface
To those who have not developed the skill, computer programming can seem like a
kind of magic. If programming is magic, then web scraping is wizardry; that is, the
application of magic for particularly impressive and useful—yet surprisingly effortless
—feats.
In fact, in my years as a software engineer, I’ve found that very few programming
practices capture the excitement of both programmers and laymen alike quite like
web scraping. The ability to write a simple bot that collects data and streams it down
a terminal or stores it in a database, while not difficult, never fails to provide a certain
thrill and sense of possibility, no matter how many times you might have done it
before.
It’s unfortunate that when I speak to other programmers about web scraping, there’s a
lot of misunderstanding and confusion about the practice. Some people aren’t sure if
it’s legal (it is), or how to handle the modern Web, with all its JavaScript, multimedia,
and cookies. Some get confused about the distinction between APIs and web scra‐
pers.
This book seeks to put an end to many of these common questions and misconcep‐
tions about web scraping, while providing a comprehensive guide to most common
web-scraping tasks.
Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate con‐
cepts. These code samples are in the public domain, and can be used with or without
attribution (although acknowledgment is always appreciated). All code samples also
will be available on the website for viewing and downloading.
vii
What Is Web Scraping?
The automated gathering of data from the Internet is nearly as old as the Internet
itself. Although web scraping is not a new term, in years past the practice has been
more commonly known as screen scraping, data mining, web harvesting, or similar
variations. General consensus today seems to favor web scraping, so that is the term
I’ll use throughout the book, although I will occasionally refer to the web-scraping
programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other
than a program interacting with an API (or, obviously, through a human using a web
browser). This is most commonly accomplished by writing an automated program
that queries a web server, requests data (usually in the form of the HTML and other
files that comprise web pages), and then parses that data to extract needed informa‐
tion.
In practice, web scraping encompasses a wide variety of programming techniques
and technologies, such as data analysis and information security. This book will cover
the basics of web scraping and crawling (Part I), and delve into some of the advanced
topics in Part II.
viii | Preface
want to use such as Twitter posts or Wikipedia pages. In general, it is preferable to use
an API (if one exists), rather than build a bot to get the same data. However, there are
several reasons why an API might not exist:
• You are gathering data across a collection of sites that do not have a cohesive API.
• The data you want is a fairly small, finite set that the webmaster did not think
warranted an API.
• The source does not have the infrastructure or technical ability to create an API.
Even when an API does exist, request volume and rate limits, the types of data, or the
format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your
browser, you can access it via a Python script. If you can access it in a script, you can
store it in a database. And if you can store it in a database, you can do virtually any‐
thing with that data.
There are obviously many extremely practical applications of having access to nearly
unlimited data: market forecasting, machine language translation, and even medical
diagnostics have benefited tremendously from the ability to retrieve and analyze data
from news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The
2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of
English-language blog sites for phrases starting with “I feel” or “I am feeling.” This led
to a popular data visualization, describing how the world was feeling day by day and
minute by minute.
Regardless of your field, there is almost always a way web scraping can guide business
practices more effectively, improve productivity, or even branch off into a brand-new
field entirely.
Preface | ix
If you’re looking for a more comprehensive Python resource, the book Introducing
Python by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter
attention spans, the video series Introduction to Python by Jessica McKellar is an
excellent resource.
Appendix C includes case studies, as well as a breakdown of key issues that might
affect how you can legally run scrapers in the United States and use the data that they
produce.
Technical books are often able to focus on a single language or technology, but web
scraping is a relatively disparate subject, with practices that require the use of databa‐
ses, web servers, HTTP, HTML, Internet security, image processing, data science, and
other tools. This book attempts to cover all of these to an extent for the purpose of
gathering data from remote sources across the Internet.
Part I covers the subject of web scraping and web crawling in depth, with a strong
focus on a small handful of libraries used throughout the book. Part I can easily be
used as a comprehensive reference for these libraries and techniques (with certain
exceptions, where additional references will be provided).
Part II covers additional subjects that the reader might find useful when writing web
scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a
single chapter. Because of this, frequent references will be made to other resources
for additional information.
The structure of this book is arranged to be easy to jump around among chapters to
find only the web-scraping technique or information that you are looking for. When
a concept or piece of code builds on another mentioned in a previous chapter, I will
explicitly reference the section that it was addressed in.
x | Preface
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
Preface | xi
Safari® Books Online
Safari Books Online is an on-demand digital library that deliv‐
ers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi‐
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable
database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-
Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
Course Technology, and dozens more. For more information about Safari Books
Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/1ePG2Uj.
To comment or ask technical questions about this book, send email to bookques‐
[email protected].
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
xii | Preface
Acknowledgments
Just like some of the best products arise out of a sea of user feedback, this book could
have never existed in any useful form without the help of many collaborators, cheer‐
leaders, and editors. Thank you to the O’Reilly staff and their amazing support for
this somewhat unconventional subject, to my friends and family who have offered
advice and put up with impromptu readings, and to my coworkers at LinkeDrive who
I now likely owe many hours of work to.
Thank you, in particular, to Allyson MacDonald, Brian Anderson, Miguel Grinberg,
and Eric VanWyk for their feedback, guidance, and occasional tough love. Quite a few
sections and code samples were written as a direct result of their inspirational sugges‐
tions.
Thank you to Yale Specht for his limitless patience throughout the past nine months,
providing the initial encouragement to pursue this project, and stylistic feedback dur‐
ing the writing process. Without him, this book would have been written in half the
time but would not be nearly as useful.
Finally, thanks to Jim Waldo, who really started this whole thing many years ago
when he mailed a Linux box and The Art and Science of C to a young and impression‐
able teenager.
Preface | xiii
PART I
Building Scrapers
This section focuses on the basic mechanics of web scraping: how to use Python to
request information from a web server, how to perform basic handling of the server’s
response, and how to begin interacting with a website in an automated fashion. By
the end, you’ll be cruising around the Internet with ease, building scrapers that can
hop from one domain to another, gather information, and store that information for
later use.
To be honest, web scraping is a fantastic field to get into if you want a huge payout for
relatively little upfront investment. In all likelihood, 90% of web scraping projects
you’ll encounter will draw on techniques used in just the next six chapters. This sec‐
tion covers what the general (albeit technically savvy) public tends to think of when
they think of “web scrapers”:
Once you start web scraping, you start to appreciate all the little things that browsers
do for us. The Web, without a layer of HTML formatting, CSS styling, JavaScript exe‐
cution, and image rendering, can look a little intimidating at first, but in this chapter,
as well as the next one, we’ll cover how to format and interpret data without the help
of a browser.
This chapter will start with the basics of sending a GET request to a web server for a
specific page, reading the HTML output from that page, and doing some simple data
extraction in order to isolate the content that we are looking for.
Connecting
If you haven’t spent much time in networking, or network security, the mechanics of
the Internet might seem a little mysterious. We don’t want to think about what,
exactly, the network is doing every time we open a browser and go to http://
google.com, and, these days, we don’t have to. In fact, I would argue that it’s fantastic
that computer interfaces have advanced to the point where most people who use the
Internet don’t have the faintest idea about how it works.
However, web scraping requires stripping away some of this shroud of interface, not
just at the browser level (how it interprets all of this HTML, CSS, and JavaScript), but
occasionally at the level of the network connection.
To give you some idea of the infrastructure required to get information to your
browser, let’s use the following example. Alice owns a web server. Bob uses a desktop
computer, which is trying to connect to Alice’s server. When one machine wants to
talk to another machine, something like the following exchange takes place:
3
1. Bob’s computer sends along a stream of 1 and 0 bits, indicated by high and low
voltages on a wire. These bits form some information, containing a header and
body. The header contains an immediate destination of his local router’s MAC
address, with a final destination of Alice’s IP address. The body contains his
request for Alice’s server application.
2. Bob’s local router receives all these 1’s and 0’s and interprets them as a packet,
from Bob’s own MAC address, and destined for Alice’s IP address. His router
stamps its own IP address on the packet as the “from” IP address, and sends it off
across the Internet.
3. Bob’s packet traverses several intermediary servers, which direct his packet
toward the correct physical/wired path, on to Alice’s server.
4. Alice’s server receives the packet, at her IP address.
5. Alice’s server reads the packet port destination (almost always port 80 for web
applications, this can be thought of as something like an “apartment number” for
packet data, where the IP address is the “street address”), in the header, and
passes it off to the appropriate application – the web server application.
6. The web server application receives a stream of data from the server processor.
This data says something like:
- This is a GET request
- The following file is requested: index.html
7. The web server locates the correct HTML file, bundles it up into a new packet to
send to Bob, and sends it through to its local router, for transport back to Bob’s
machine, through the same process.
And voilà! We have The Internet.
So, where in this exchange did the web browser come into play? Absolutely nowhere.
In fact, browsers are a relatively recent invention in the history of the Internet, when
Nexus was released in 1990.
Yes, the web browser is a very useful application for creating these packets of infor‐
mation, sending them off, and interpreting the data you get back as pretty pic‐
tures, sounds, videos, and text. However, a web browser is just code, and code can be
taken apart, broken into its basic components, re-written, re-used, and made to do
anything we want. A web browser can tell the processor to send some data to the
application that handles your wireless (or wired) interface, but many languages have
libraries that can do that just as well.
Let’s take a look at how this is done in Python:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
You can save this code as scrapetest.py and run it in your terminal using the com‐
mand:
A
B
Fig. 24. A Case of Depressed Birth-fracture. a, Before operation; b,
After operation. (For further description, see text.)
BIRTH-HÆMORRHAGES
DERMOIDS
HYDROCEPHALUS
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com