06 WebScrapingData
06 WebScrapingData
Dr Ali Anaissi
School of Computer Science
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 1
Learning Objectives This Week
– Web Scraping basics
– Python Libraries for Web Scraping
– Web Standards / Intro Semi-structured Data
– HTML
– CSS selectors
– Data Cleaning
– Storing and querying semi-structured data in PostgreSQL
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 2
Web Scraping
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 3
Getting Data
– Existing files: Excel Sheets, CSV, …
– Databases
– Querying existing databases with SQL
– Scraping the Web
– Web crawling + HTML parsing
– Programming APIs
– 'querying' web service APIs
– more details on following slides
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 4
Motivating Example: How can we extract this data?
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 5
Scraping the Web
– Web pages are written in HTML which is a semi-structured
data format with some similarity to XML
<html>
<head>
<title>Data Science, Big Data and Data Diversity</title>
</head>
<body>
<h1><span class="uoscode>DATA2001</span> - Data Science and Big Data</h1>
<div class="lecturer">Uwe Roehm</div>
<p id="4711" class="description">DATA2001 is about …</p>
</body>
</html>
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 6
Overview Web Requests
Web
Browser html
Web Server
http
(either static pages
or dynamic web pages read files
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 9
Scraping the Web (cont’d)
– There are several support libraries for Python to scrap the web
– HTML crawling: Requests library (import requests)
–
http://docs.python-requests.org/en/master/
HTML parsing: html5lib https://www.crummy.com/software/BeautifulSoup/bs4/doc/
– Example Code:
import requests
html = requests.get("http://www.example.com").text
soup = BeautifulSoup(html, 'html5lib')
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 13
Some Tips from our TA Harshana and former Tutor Chris
– URL format is incredibly important to take note of
– Look at any parameters (e.g. flight data from Airport OnTime dataset:
http://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Perform
ance_1987_present_2021_1.zip - the year and month are parameterized and
can be looped over)
– Find any patterns with links when accessing data
(i.e. do they do it monthly, yearly, bi-weekly etc.)
– Access tokens (i.e. do they pass an API key or?)
– Web page structure is useful to note.
– Use the page inspector to narrow down what you're looking for.
– Complex tokenising can get messy (we might have to tokenise child nodes of the
elements)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 14
Robots Exclusion Standard
– Many websites provide a robots.txt file
– Meant for web crawlers who should check this content first before starting
crawling a website
– Different rules in here Example: https://en.wikipedia.org/robots.txt
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 16
Data Cleaning
– This is a topic on its own…
– Data from website hardly is in a clean format
– neither from the format
– nor from the content
– Rules of thumb:
– Be prepared that things are different than they are supposed to be (“ , ; \t)
– Clean data before further processing or storage
• Eg. empty cells; placeholders; special characters or excess spaces;
• Do it programmatically so that you can re-use your solution
– Cross-check data consistency once loaded; eg. spelling variants of same entity?
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 17
Extracting Data from HTML
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 18
Web Retrieval in Python (Single Pages)
– Two forms of requests: GET (optional with parameters) or POST (with params)
– Making Simple Web Requests in Python
– Python requests library:
standard webpage: requests.get(URL)
webpage with parameters: requests.get(URL, params=dict(key=value,…))
web form (POST request): requests.post(URL, params=dict(key=value,…))
– Example:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.example.com")
print(response.status_code) # inspect response code of server
content = BeautifulSoup(response.text, 'html5lib')
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 19
Web Page Retrieval: URLs
– URL – Uniform Resource Locator
– “address” format on the web
– Example:
• https://www.health.nsw.gov.au/news/Pages/20220329_00.aspx
– General Format
• protocol://site/path_to_resource
• Typical protocols: http https ftp
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 20
Web Site Crawling in Python (multiple page scraping)
– Scrapy
– Extensive Python framework to implement a web ‘spider’ – a program that follows multiple links
along the web
– Can extend this spider class with own functionality which extracts parts of the visited pages while
the spider follows further links
– https://docs.scrapy.org/en/latest/intro/overview.html
– Selenium
– A programmable web browser for which a Python binding exists which allows to actually send
requests as if a user would have clicked on links or used a page (including running included
javascripts)
– Typically used for automatic testing of websites
– But can also be used for ‘crawling’ a complex interactive website
– https://selenium-python.readthedocs.io
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 21
HTML – Hypertext Markup Language
– Webpages are written in HTML
– Textual markup language that defines structure, content, and design
of a page as well as active elements (scripts, forms, etc.)
– Typically several additional files linked:
• CSS - cascading style sheets
• Scripts, Images, videos etc.
– Markup via open & closing tags in (e.g. <title>…</title>)
– Pre-defined in HTML standards (http://www.w3.org)
– Interpreted by web browsers for display
– HTML is designed to be interpreted by programs
– How to extract data with own programs?
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 22
HTML Example
<!DOCTYPE html>
<html>
<head>
<title>Literature List…</title>
</head>
<body>
<h1>References</h1>
<p>The following are some interesting links on web scraping:</p>
<div id=“biblist”>
<ul>
<li> "Data Science From Scratch", Chapter 23 </li>
<li> <a href=“http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-
data-journalists/”>Web Scraping for Data Journalists</a> </li>
…
</ul>
</div> …
</body>
</html>
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 23
General Structure of a Web Page
– Head
– title, style sheets, scripts, meta-data
– Body
– headings, text, lists, tables, images, forms etc.
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 24
How to Select Content in a Webpage?
– Four options:
– text patterns
• simple, but not really great for complex patterns as we rely on some own
parsing…
– DOM navigation
• Document object model
– CSS selectors
• based on the tag types, class specifications and IDs elements
• easy to specify, but depends on CSS classes and IDs being well used
– XPath expressions => Week 7
• powerful language that allows to navigate along document tree
and select all nodes or even sub-trees which match the path expr.
• can contain filter predicates, e.g. on values of XML/HTML attributes
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 25
Content Extraction with BeautifulSoup
Example for DOM-based navigation and data extraction:
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 26
Contact Extraction into Pandas DataFrame
– Example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.example.com")
print(response.status_code) # inspect response code of server
content = BeautifulSoup(response.text, 'html5lib’)
table = content.find_all(‘table’)[0]
df = pd.read_html(str(table))[0] # only works with HTML tables
countries = df[“COUNTRY”].tolist()
print(countries)
# pretty print
from tabulate import tabulate
print ( tabulate(df[0], headers=“keys”, tablefmt=“psql”) )
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 27
HTML Document Model (DOM): Element-Tree
root BeautifulSoup’s
(document-
Element)
html dot notation
(cf. previous slide)
allows to follow
element a path along this
head body DOM tree
“title" is contained path
in “head"
title meta script h1 p div
id=“results” attribute
Albion content=“text/html” Albion Voyage 1823
table
Convict Ship 1823 class=“data”
Javascript
… Sailed to
Van Dieman’s
tr tr ...
Land in 1823. ...
th ... td
Sibling tags have an order from left to right! Convict William
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 28
CSS Selectors
– HTML elements can have multiple CSS class attributes associate for display,
as well as an ID attribute for identification
– Example: <table class=”data” id=“42”>
– CSS Selectors (the most important ones):
– Selecting an element e with a specific class: e.class
• E.g. <table class=“data”> => table.data
– Selecting an element e by ID: e#id
• E.g. <div id=“results”> => div#results or just #results
– Selecting by position within a parent element
• e:first-child e:last-child e:nth-child(n)
– Selecting instances out of multiple occurences
• e:nth-of-type(n)
– …
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 29
Using CSS Selectors with BeautifulSoup
– BeautifulSoup provides several functions that support CSS selectors:
find() find_all() select()
– Examples: (assuming page_content is a parsed webpage)
– Finding an element by type:
elements = page_content.find_all(“h3”)
for e in elements: …
– Finding an element by id:
element = page_content.find(id=“ship”)
– Finding table elements with a specific CSS class:
element = page_content.find_all(“table”, “data”)
– Look for tags matching general (complex) CSS selector:
elements = page_content.select(“#ship .data”)
for e in elements: …
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 30
Scraping HTML with Pandas
import pandas as pd
# pandas can also read directly from a URL – but only tables!
dfs =
pd.read_html(‘https://www.health.nsw.gov.au/infectious/diseases/Pages
/covid-19-latest.aspx’)
# scrapes tables as a list(!) of DataFrames
dfs[2].tail()
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 32
Storing Extracted Data?
– Crawling web pages takes some time, hence a good idea to
store the data locally once extracted
– Avoids to re-crawl the remote servers every time for a new analysis
– Two main options
– File systems (CSV or XML files)
– Database
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 33
Storing in CSV files (plain Python)
– Assumption:
– Data extracted, cleaned and collected in Python arrays or dictionaries
– Export to CSV via Python example:
import csv
...
with open(”coviddata.csv", "w") as csvfile:
writer = csv.writer(csvfile) #use csv.DictWriter(…) if writing a Dictionary var
# nswstats = [
# [”Adamant", ”26.3.1821", "https://convictrecords.com.au/ships/adamant"],
# [”Albion", ”29.5.1828", "https://convictrecords.com.au/ships/albion"],
# ...
# ]
for s in nswstats:
writer.writerow(s)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 34
Storing in CSV files using Pandas
import pandas as pd
import requests
from bs4 import BeautifulSoup
df.to_csv(‘covid_stats_nsw_by_age_group.csv’)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 35
Storing extracted data in Databases
– If data is structured and already prepared, this is pretty straight forward
(Assumption: Data extracted, cleaned and collected in Python arrays or dictionaries)
– Export to SQL database via Python, example:
import psycopg2
def pgconnect(): …
def pgquery(): …
# 3rd: load data (assuming dictionary variable ‘ships’ with given keys)
insert_stmt = "INSERT INTO Ships VALUES (%(name)s,%(last_voyage),%(url))"
for s in ships:
pgquery (conn, insert_stmt, s) // alternatively use pandas’ df.to_sql()
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 36
Lessons Learned
– Web Scraping
– Steps: exploring, crawling, parsing, cleaning, storing&analysis
– Many tools and support libraries
– Scraping web pages with Python using request, beautifulsoup, lxml, ...
– HTML and XML are Semi-structured Data Models
– Data models that can handle variants and optional attributes
– Self-describing; does not require schema first (still: valid vs. well-formed)
– Central model: tree
– Storing extracted Web Data
– Storage of scraped data and even XML in files or databases possible
– querying XML is difficult because of the nested, graph-like structure
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 37
Next Week
– SQL Test (Wed, 5th April, 12pm AEST, online via ED)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 38
References
– "Data Science From Scratch", Chapter 23
– Python Libraries:
– PIP: sudo python -m ensurepip --default-pip
– Requests library (‘pip install requests’) - http://docs.python-requests.org/en/master/
– BeautifulSoup4 (`pip install bs4`) - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
– LXML (‘pip install lxml’) - https://lxml.de/lxmlhtml.html
– Scrapy - https://docs.scrapy.org/en/latest/intro/overview.html
– Selenium - https://selenium-python.readthedocs.io
– Semistructured Data, XML: http://www.w3.org/TR/xml
– PostgreSQL Online documentation
– http://www.postgresql.org/docs/current/static/
– http://www.postgresql.org/docs/current/static/datatype-xml.html
– http://www.postgresql.org/docs/current/static/functions-aggregate.html
– General tips
– http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/
https://ianlondon.github.io/blog/web-scraping-discovering-hidden-apis/
https://github.com/stanfordjournalism/search-script-scrape
https://blog.hartleybrody.com/web-scraping-cheat-sheet/
https://bigishdata.com/2017/06/06/web-scraping-with-python-part-two-library-overview-of-requests-urllib2-beautifulsoup-
DATA2001 "Data Science,lxml-scrapy-and-more/
Big Data and Data Diversity" - 2022 (Roehm) 39