0% found this document useful (0 votes)
220 views

The A-Z of Web Scraping in 2020 (A How-To Guide)

Web scraping or data extraction in 2020 is the only way to get desired data if owners of a web site don't grant access to their users through API.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views

The A-Z of Web Scraping in 2020 (A How-To Guide)

Web scraping or data extraction in 2020 is the only way to get desired data if owners of a web site don't grant access to their users through API.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

The A-Z of Web Scraping in 2020 [A How-To Guide]

WEB SCRAPING

The A-Z of Web


Scraping in 2020 [A
How-To Guide]
Web scraping or data extraction in 2020 is the only way to get
desired data if owners of a web site don't grant access to their
users through API.

DMITRY NARIZHNYKH
7 JUN 2020 • 11 MIN READ
The A-Z of Web Scraping in 2020 [A How-To Guide]

Many web sites like Twitter, YouTube, or Facebook provide an easy way to
access their data through a public API. All the information that you obtained
using API is both well structured and normalized. For example, it can be in the
format of JSON, CSV, or XML.

3 ways to extract data from any website.

Web Scraping vs API.

#1 O cial API.
First of all, you should always check out if there's an official API that you can
use to get the desired data.

Sometimes the official API is not updated accurately, or some of the data are
missing from it.

#2 "Hidden API".
The backend might generate data in JSON or XML format, consumed by the
frontend.

Investigating XMLHttpRequest (XHR) with a web browser inspector gives us


another way to access the data. It would provide us the data in the same way as
an official API would do it.

How to get this data? Let's hunt for API endpoint!

For example, let's look at https://covid-19.dataflowkit.com/ resource showing


local COVID-19 cases for website visitors.
TheDevTools
1. Call Chrome A-Z of Web ScrapingCtrl+Shift+I
by pressing in 2020 [A How-To Guide]

2. Once the console appears, go to the "Network" tab.

3. Let's select the XHR filter to catch an API endpoint as the "XHR"
request if it is available."

4. Make sure the "recording" button is enabled.

5. Refresh the webpage.

6. Click Stop "recording" when you see the data related content has
already appeared on the webpage.

Learn how to nd out API endpoint using Chrome DevTools

Now you can see a list of requests on the left. Investigate them. The preview
tab shows an array of values for the item named "v1."

Press the "Headers" tab to see details of the request. The most important thing
for us is the URL.  Request URL for "v1" is https://covid-19.dataflowki
t.com/v1 .
Now, let's just open that URL as another browser tab to see what happens.

Cool! That's what we're looking for.

Taking data either directly from an API or using the technique described above

is the easiest way to download datasets from websites. What to do if owners of


a web site don't grant access to their users through API?
The A-Z of Web Scraping in 2020 [A How-To Guide]

Of course, theses approaches are not going to be useful for all the websites, and
that is why web scraping libraries are still necessary.

#3 Website scraping.

What is web scraping?

According to Wikipedia: "Web scraping, web harvesting, or web


data extraction is data scraping used for extracting data from
websites."

Web data extraction or web scraping is the only way to get desired data if
owners of a web site don't grant access to their users through API. Web
Scraping is the data extraction technique that substitutes manual repetitive
typing or copy-pasting.

Photo by Adam Sherez / Unsplash


The A-Z of Web Scraping in 2020 [A How-To Guide]

Know the rules!


What should you check before scraping a website?

☑ Robots.txt is the first thing to check when you plan to scrape website data.
Robots.txt file lists the rules on how you or a bot should interact with them.
You should always respect and follow all the rules listed in robots.txt.

☑ Make sure you also look at a site's Terms of use. If terms of use provision
do not say that it limits access to bots and spiders and does not prohibit rapid
requests of the server, crawling is fine.

☑ To be compliant with the new EU General Data Protection Regulation,


or GDPR, you should first evaluate your web scrapping project.

If you don't scrape personal data, then GDPR does not apply. In this case, you
can skip this section and move to the next step.

☑ Be careful about how you use the extracted data as you may violate the
copyrights sometimes. If the terms of use do not provide a limitation on a
particular use of the data, anything goes so long as the crawler does not violate
copyright.

Find more information: Is web scraping legal or not?

Sitemaps
Typical websites have sitemap files containing a list of links belong to this web
site. They help to make it easier for search engines to crawl web sites and index
their pages. Getting URLs from sitemaps to crawl is always much faster than
gathering it sequentially with a web scraper.
The A-Z of Web Scraping in 2020 [A How-To Guide]

Render JavaScript-driven web sites


JavaScript Frameworks like Angular, React, Vue.js used widely for building
modern web applications. In short, a typical web application frontend consists
of  HTML + JS code + CSS Styles. Usually, source HTML initially does not
contain all the actual content. During a web page download, HTML DOM
elements are loaded dynamically along with rendering JavaScript code. As a
result, we get rendered static HTML.

⚠ You can do web scraping with Selenium, but it is not a good idea. Many
tutorials are teaching how to use Selenium for scraping data from websites.
Their home page clearly states that Selenium is "for automating web
applications for testing purposes."

🚫 PhantomJS was suitable to take care of such tasks earlier, but since 2018 its
development has been suspended.

⚠ Alternatively, Scrapinghub's Splash was an option for Python programmers


before Headless Chrome.

☑ Your browser is a website  scraper by its nature. The best way nowadays is
to use Headless Chrome as it renders web pages "natively."

Puppeteer Node library is the best choice for Javascript developers to control
Chrome over DevTools Protocol.

Go developers have an option to choose from either chromedp or cdp to access


Chrome via DevTools protocol.

Check out online HTML scraper that renders Javascript dynamic content in
Check out online HTML scraper that renders Javascript dynamic content in
the cloud.   The A-Z of Web Scraping in 2020 [A How-To Guide]

Be smart. Don't let them block you.

Photo by Randy Fath / Unsplash

Some web sites use anti-scraping techniques to prevent web scrapper tools
from harvesting online data. Web scraping is always a "cat and mouse" game.
So when building a web scraper, consider the following ways to avoid getting
blocked. Or you risk not receiving the desired results.

Tip #1: Make random delays between requests.


When a human visits a web site, the speed of accessing different pages is in
times less compared to a web crawler's one. Web scraper, on the opposite, can
extract several pages simultaneously in no time. Huge traffic coming to the site
in a short period on time looks suspicious.

You should find out the ideal crawling speed that is individual for each
g p
website. To mimic human user behavior, you can add random delays between
The A-Z of Web Scraping in 2020 [A How-To Guide]
requests.

Don't create excessive load for the site. Be polite to the site that you extract
data from so that you can keep scraping it without getting blocked.

Tip #2: Change User-agents.


When a browser connects to a web site, it passes the User-Agent (UA) string in
the HTTP header. This field identifies the browser, its version number, and a
host operating system.

A typical user agent string looks like this: "Mozilla/5.0 (X11; Linux x86_6
4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safa
ri/537.36".

If multiple requests to the same domain consist of the same user-agent,


the web site can detect and block you very soon.

Some websites block specific requests if they contain User-Agent that


differ from a general browser.

If "user-agent" value is missed, many websites won't allow accessing


their content.

What is the solution?

You have to build a list of user-agents and rotate them randomly.

Tip #3: Rotate IP addresses. Use Proxy servers.


If you send multiple requests from the same IP address during scraping, the
website considers suspicious behavior and blocks you.

For the most straightforward cases, it is enough to use the cheapest Datacenter
proxies.  But some websites have advanced bot detection algorithms, so you
The A-Z of Web Scraping in 2020 [A How-To Guide]
have to use either residential or mobile proxies to scrape them.

For example, someone in Europe wants to extract data from a website with
limited access to US users only. It is evident to make requests through a proxy
server located in the USA since their traffic seems to be coming from the local
to US IP address.

To obtain country-specific versions of target websites, just specify any


arbitrary country in request parameters in Dataflow Kit HTML scraping
service.

Specify country from which one to send requests to target website.

Tip #4: Avoid scraping patterns. Imitate humans


behavior.
Humans are not consistent while navigating a website. They do different
random actions like clicks on the page and mouse movements.

In opposite, web scraping bots follow specified patterns when crawling a web
site.

Teach your scraper to imitate human beings' behavior. This way, website bot
detection algorithms don't have any reason to block you from automation your
detection algorithms don t have any reason to block you from automation your
scraping tasks.The A-Z of Web Scraping in 2020 [A How-To Guide]

Tip #5: Keep eyes on anti-scraping tools.


One of the most frequently used tools for the detection of hacking or web
scraping  attempts is the "honey pot."  The honey pots are not visible to the
human eye but can be seen by bots or web scrapers. Right after your scraper
clicks such a hidden link, the site blocks you quite easily.

Find out whether a link has the "display: none" or "visibility: hidde
n" CSS properties set if they do just stop following that link. Otherwise, a site
immediately identifies you as a bot or scraper, fingerprints the properties of
your requests, and bans you.

Tip #6: Solve online CAPTCHAs.


While scraping a website on a large scale, there is a chance to be blocked by a
website. Then you start seeing captcha pages instead of web pages.

CAPTCHA is a test used by websites to battle back against bots and crawlers,
asking website visitors to prove they're human before proceeding.

Many websites use reCAPTCHA from Google. The last version v3 of


reCAPTCHA analyses human behavior and require them to tick "I'm not a r
obot" box.

CAPTCHA solving services use two methods for solving CAPTCHAs:

☑ Human-based CAPTCHA Solving Services


When you send your CAPTCHA to such service,  human workers solve a
CAPTCHA and send it back.

☑ OCR (Optical Character Recognition) Solutions


The A-Z of Web Scraping in 2020 [A How-To Guide]
In this case, OCR technology is used to solve CAPTCHAs automatically.

Point-and-click visual selector.


Of course, we don't intend only to download and render JavaScript-driven web
pages but to extract structured data from them.

Before starting of data extraction, let's specify patterns of data. Look at the
sample screenshot taken from web store selling smartphones. We want to
scrape the Image, Title of an item, and its Price.

Specify patterns of data to choose similar elements on a web page.

Google chrome inspect tool does a great job of investigating the DOM
structure of HTML web pages.
The A-Z of Web Scraping in 2020 [A How-To Guide]

Click the Inspect icon in the top-left corner of DevTools.

Chrome Inspector tool

With the Chrome Inspect tool, you can easily find and copy either CSS
Selector or XPath of specified DOM elements on the web page.

Usually, when scraping a web page, you have more than one similar block of
data to extract. Often you crawl several pages during one scraping session.

Surely, you can use Chrome Inspector to build a payload for scraping. In some
complex cases, it is only a way to investigate particular element properties on a
web page.

Though modern online web scrapers, in most cases, offer a more comfortable
The A-Z of Web Scraping in 2020 [A How-To Guide]
How to crawl several million pages and extract tens of million records?

What to do if the size of output data is from moderate to huge?

Choose the right format as output data.

Photo by Ricardo Gomez Angel / Unsplash

Format #1. Comma Separated Values (CSV) format

CSV is the most simple human-readable data exchange format. Each line of the
file is a data record. Each record consists of an identical list of fields separated

by commas.

Here is a list of families represented as CSV data:


The A-Z of Web Scraping in 2020 [A How-To Guide]

id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2

CSV is limited to store two-dimensional, untyped data. There is no way to


specify nested structures or types of values like the names of children in plain
CSV.

Format #2. JSON

[
{
"id":1,
"father":"Mark",
"mother":"Charlotte",
"children":[
"Tom"
]
},
{
"id":2,
"father":"John",
"mother":"Ann",
"children":[
"Jessika",
"Antony",
"Jack"
]
},
{
"id":3,
"father":"Bob",
"mother":"Monika",
"children":[
"Jerry",
"Karol"
]
}
] The A-Z of Web Scraping in 2020 [A How-To Guide]

Representing nested structures in JSON files is easy, however.

Nowadays, JavaScript Object Notation (JSON) became a de-facto of data


exchange format standard, replacing XML in most cases.

One of our projects consists of 3 Millions of parsed pages. As a result, the size
of the final JSON is more than 700 Mb.

The problem arises when you have to deal with such sized JSONs. To insert or
read a record from a JSON array, you need to parse the whole file every time,
which is far from ideal.

Format #3. JSON Lines


Let's look into what JSON Lines format is, and how it compares to traditional
JSON. It is already common in the industry to use JSON Lines. Logstash and
Docker store logs as JSON Lines.

The same list of families expressed as a JSON Lines format looks like this:

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","An
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Ka

JSON Lines consists of several lines in which each line is a valid JSON object,
separated by the newline character \n .
way to specify patterns (CSS Selectors or XPath) for data scraping, set up
The A-Z of Web Scraping in 2020 [A How-To Guide]
pagination rules, and rules for processing detailed pages on its way.

Look at this video to find out how it works.

Web scraper. Extract data from web sites. - D…


D…

Build rules for data extraction with "Point-and-Click" data selector. 

— Try to build a web scraper yourself! —

Manage your Data Storage strategy.


The most well-known simple data formats for storing structured data nowadays
include CSV, Excel, JSON (Lines). Extracted data may be encoded to

destination format right after parsing a web page. These formats are suitable
for use as low sized volumes storages.

Crawling a few pages may be easy,  but millions of pages require different
approaches.
Since every entry in JSON Lines is a valid JSON, you can parse every line as a
Thedocument.
standalone JSON A-Z of Web
ForScraping
example,inyou
2020
can[Aseek
How-To
withinGuide]
it, split a 10gb
file into smaller files without parsing the entire thing. You can read as many
lines as needed to get the same amount of records.

Summary

A good scraping platform should:


☑ Fetch and extract data from web pages concurrently.
We use concurrency features of Golang, and found them fantastic;

☑ Persist extracted blocks of scraped data in the central database regularly.


This way, you don't have to store much data in the RAM while scraping many
pages. Besides, it is easy to export data to different formats several times later.
We use MongoDB as our central storage.

☑  Be web-based.
Online Website scraper is accessible anywhere from any device which can
connect to the internet. Different operating systems aren't an issue anymore. It's
all about the browser.

☑  Be cloud-friendly.
It should provide a way to quickly scale up or down cloud capacity according
to the current requirement of a web data extraction project.
The A-Z of Web Scraping in 2020 [A How-To Guide]
Conclusion

In this post, I tried to explain how to scrape web pages in the year
2020.  But before considering scraping, try to find out official API
exists or hunt for some "hidden" API endpoints.

I would appreciate it if you could take a minute to tell me which one of the web
scraping methods you use the most in 2020. Just leave me a comment below.

Happy scraping!

Need to get data from a website?


Check if they have the o cial API?

 ➡
Hunt for "Hidden" API endpoints?

What is a present-day web scraper


in 2020?
https://data owkit.com/blog/what-
is-a-present-day-web-scraper/ via
@data owkit #webscraping
#webdev #datascraping
Click To Tweet

You might also like