The A-Z of Web Scraping in 2020 (A How-To Guide)
The A-Z of Web Scraping in 2020 (A How-To Guide)
WEB SCRAPING
DMITRY NARIZHNYKH
7 JUN 2020 • 11 MIN READ
The A-Z of Web Scraping in 2020 [A How-To Guide]
Many web sites like Twitter, YouTube, or Facebook provide an easy way to
access their data through a public API. All the information that you obtained
using API is both well structured and normalized. For example, it can be in the
format of JSON, CSV, or XML.
#1 O cial API.
First of all, you should always check out if there's an official API that you can
use to get the desired data.
Sometimes the official API is not updated accurately, or some of the data are
missing from it.
#2 "Hidden API".
The backend might generate data in JSON or XML format, consumed by the
frontend.
3. Let's select the XHR filter to catch an API endpoint as the "XHR"
request if it is available."
6. Click Stop "recording" when you see the data related content has
already appeared on the webpage.
Now you can see a list of requests on the left. Investigate them. The preview
tab shows an array of values for the item named "v1."
Press the "Headers" tab to see details of the request. The most important thing
for us is the URL. Request URL for "v1" is https://covid-19.dataflowki
t.com/v1 .
Now, let's just open that URL as another browser tab to see what happens.
Taking data either directly from an API or using the technique described above
Of course, theses approaches are not going to be useful for all the websites, and
that is why web scraping libraries are still necessary.
#3 Website scraping.
Web data extraction or web scraping is the only way to get desired data if
owners of a web site don't grant access to their users through API. Web
Scraping is the data extraction technique that substitutes manual repetitive
typing or copy-pasting.
☑ Robots.txt is the first thing to check when you plan to scrape website data.
Robots.txt file lists the rules on how you or a bot should interact with them.
You should always respect and follow all the rules listed in robots.txt.
☑ Make sure you also look at a site's Terms of use. If terms of use provision
do not say that it limits access to bots and spiders and does not prohibit rapid
requests of the server, crawling is fine.
If you don't scrape personal data, then GDPR does not apply. In this case, you
can skip this section and move to the next step.
☑ Be careful about how you use the extracted data as you may violate the
copyrights sometimes. If the terms of use do not provide a limitation on a
particular use of the data, anything goes so long as the crawler does not violate
copyright.
Sitemaps
Typical websites have sitemap files containing a list of links belong to this web
site. They help to make it easier for search engines to crawl web sites and index
their pages. Getting URLs from sitemaps to crawl is always much faster than
gathering it sequentially with a web scraper.
The A-Z of Web Scraping in 2020 [A How-To Guide]
⚠ You can do web scraping with Selenium, but it is not a good idea. Many
tutorials are teaching how to use Selenium for scraping data from websites.
Their home page clearly states that Selenium is "for automating web
applications for testing purposes."
🚫 PhantomJS was suitable to take care of such tasks earlier, but since 2018 its
development has been suspended.
☑ Your browser is a website scraper by its nature. The best way nowadays is
to use Headless Chrome as it renders web pages "natively."
Puppeteer Node library is the best choice for Javascript developers to control
Chrome over DevTools Protocol.
Check out online HTML scraper that renders Javascript dynamic content in
Check out online HTML scraper that renders Javascript dynamic content in
the cloud. The A-Z of Web Scraping in 2020 [A How-To Guide]
Some web sites use anti-scraping techniques to prevent web scrapper tools
from harvesting online data. Web scraping is always a "cat and mouse" game.
So when building a web scraper, consider the following ways to avoid getting
blocked. Or you risk not receiving the desired results.
You should find out the ideal crawling speed that is individual for each
g p
website. To mimic human user behavior, you can add random delays between
The A-Z of Web Scraping in 2020 [A How-To Guide]
requests.
Don't create excessive load for the site. Be polite to the site that you extract
data from so that you can keep scraping it without getting blocked.
A typical user agent string looks like this: "Mozilla/5.0 (X11; Linux x86_6
4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safa
ri/537.36".
For the most straightforward cases, it is enough to use the cheapest Datacenter
proxies. But some websites have advanced bot detection algorithms, so you
The A-Z of Web Scraping in 2020 [A How-To Guide]
have to use either residential or mobile proxies to scrape them.
For example, someone in Europe wants to extract data from a website with
limited access to US users only. It is evident to make requests through a proxy
server located in the USA since their traffic seems to be coming from the local
to US IP address.
In opposite, web scraping bots follow specified patterns when crawling a web
site.
Teach your scraper to imitate human beings' behavior. This way, website bot
detection algorithms don't have any reason to block you from automation your
detection algorithms don t have any reason to block you from automation your
scraping tasks.The A-Z of Web Scraping in 2020 [A How-To Guide]
Find out whether a link has the "display: none" or "visibility: hidde
n" CSS properties set if they do just stop following that link. Otherwise, a site
immediately identifies you as a bot or scraper, fingerprints the properties of
your requests, and bans you.
CAPTCHA is a test used by websites to battle back against bots and crawlers,
asking website visitors to prove they're human before proceeding.
Before starting of data extraction, let's specify patterns of data. Look at the
sample screenshot taken from web store selling smartphones. We want to
scrape the Image, Title of an item, and its Price.
Google chrome inspect tool does a great job of investigating the DOM
structure of HTML web pages.
The A-Z of Web Scraping in 2020 [A How-To Guide]
With the Chrome Inspect tool, you can easily find and copy either CSS
Selector or XPath of specified DOM elements on the web page.
Usually, when scraping a web page, you have more than one similar block of
data to extract. Often you crawl several pages during one scraping session.
Surely, you can use Chrome Inspector to build a payload for scraping. In some
complex cases, it is only a way to investigate particular element properties on a
web page.
Though modern online web scrapers, in most cases, offer a more comfortable
The A-Z of Web Scraping in 2020 [A How-To Guide]
How to crawl several million pages and extract tens of million records?
CSV is the most simple human-readable data exchange format. Each line of the
file is a data record. Each record consists of an identical list of fields separated
by commas.
id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2
[
{
"id":1,
"father":"Mark",
"mother":"Charlotte",
"children":[
"Tom"
]
},
{
"id":2,
"father":"John",
"mother":"Ann",
"children":[
"Jessika",
"Antony",
"Jack"
]
},
{
"id":3,
"father":"Bob",
"mother":"Monika",
"children":[
"Jerry",
"Karol"
]
}
] The A-Z of Web Scraping in 2020 [A How-To Guide]
One of our projects consists of 3 Millions of parsed pages. As a result, the size
of the final JSON is more than 700 Mb.
The problem arises when you have to deal with such sized JSONs. To insert or
read a record from a JSON array, you need to parse the whole file every time,
which is far from ideal.
The same list of families expressed as a JSON Lines format looks like this:
{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","An
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Ka
JSON Lines consists of several lines in which each line is a valid JSON object,
separated by the newline character \n .
way to specify patterns (CSS Selectors or XPath) for data scraping, set up
The A-Z of Web Scraping in 2020 [A How-To Guide]
pagination rules, and rules for processing detailed pages on its way.
destination format right after parsing a web page. These formats are suitable
for use as low sized volumes storages.
Crawling a few pages may be easy, but millions of pages require different
approaches.
Since every entry in JSON Lines is a valid JSON, you can parse every line as a
Thedocument.
standalone JSON A-Z of Web
ForScraping
example,inyou
2020
can[Aseek
How-To
withinGuide]
it, split a 10gb
file into smaller files without parsing the entire thing. You can read as many
lines as needed to get the same amount of records.
Summary
☑ Be web-based.
Online Website scraper is accessible anywhere from any device which can
connect to the internet. Different operating systems aren't an issue anymore. It's
all about the browser.
☑ Be cloud-friendly.
It should provide a way to quickly scale up or down cloud capacity according
to the current requirement of a web data extraction project.
The A-Z of Web Scraping in 2020 [A How-To Guide]
Conclusion
In this post, I tried to explain how to scrape web pages in the year
2020. But before considering scraping, try to find out official API
exists or hunt for some "hidden" API endpoints.
I would appreciate it if you could take a minute to tell me which one of the web
scraping methods you use the most in 2020. Just leave me a comment below.
Happy scraping!
➡
Check if they have the o cial API?
➡
Hunt for "Hidden" API endpoints?