0% found this document useful (0 votes)
30 views

06 WebScrapingData

The document discusses web scraping basics, Python libraries for web scraping like BeautifulSoup and Requests, HTML and CSS selectors, data cleaning, and storing semi-structured data in PostgreSQL. It provides an overview of web scraping approaches and considerations like website structure and robots.txt files.

Uploaded by

Xiya Luo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

06 WebScrapingData

The document discusses web scraping basics, Python libraries for web scraping like BeautifulSoup and Requests, HTML and CSS selectors, data cleaning, and storing semi-structured data in PostgreSQL. It provides an overview of web scraping approaches and considerations like website structure and robots.txt files.

Uploaded by

Xiya Luo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DATA2001 – Data Science,

Big Data and Data Diversity


Week 6: Scraping Web Data

Dr Ali Anaissi
School of Computer Science

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 1
Learning Objectives This Week
– Web Scraping basics
– Python Libraries for Web Scraping
– Web Standards / Intro Semi-structured Data
– HTML
– CSS selectors
– Data Cleaning
– Storing and querying semi-structured data in PostgreSQL

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 2
Web Scraping

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 3
Getting Data
– Existing files: Excel Sheets, CSV, …
– Databases
– Querying existing databases with SQL
– Scraping the Web
– Web crawling + HTML parsing
– Programming APIs
– 'querying' web service APIs
– more details on following slides

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 4
Motivating Example: How can we extract this data?

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 5
Scraping the Web
– Web pages are written in HTML which is a semi-structured
data format with some similarity to XML
<html>
<head>
<title>Data Science, Big Data and Data Diversity</title>
</head>
<body>
<h1><span class="uoscode>DATA2001</span> - Data Science and Big Data</h1>
<div class="lecturer">Uwe Roehm</div>
<p id="4711" class="description">DATA2001 is about …</p>
</body>
</html>

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 6
Overview Web Requests
Web
Browser html
Web Server
http
(either static pages
or dynamic web pages read files

network via, eg., PHP or Python)


static

– Browsing the Web:


web
DB-API, pages
JDBC,
– Client program or just web browser ODBC,

sends HTTP request to web-server
– web server / application:
answers with either static content
Database System
or dynamically constructs content based on request
Database server: persistent web state
– Response from web server: HTML
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 7
Web Scraping – General Approach
– Reconnaissance
– Identify source, and check its structure and content
– Webpage Retrieval
– Download one or multiple pages from source
– Typically in a script or program that auto-generates new URLs based on
website structure and its URL format
– Data Extraction from webpage
– Content parsing, raw data extraction
– Data Cleaning and transformation into required format
– Data Storage / Analysis / combining with other data sets
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 8
Scraping the Web (cont’d)
– HTML is not always well-formed, let alone annotated or
semantically marked up
– Many HTML parsers are too strict for real-world usage, including
Python’s built-in parser
– Would stop parsing incorrectly written web pages without giving us chance
to extract data
– Luckily, there are several 3rd party support libraries or tools to
help us

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 9
Scraping the Web (cont’d)
– There are several support libraries for Python to scrap the web
– HTML crawling: Requests library (import requests)

http://docs.python-requests.org/en/master/
HTML parsing: html5lib https://www.crummy.com/software/BeautifulSoup/bs4/doc/

– Data extraction: BeautifulSoup library (import bs4)


– Website Crawling: scrapy framework
– Pandas… from bs4 import BeautifulSoup

– Example Code:
import requests
html = requests.get("http://www.example.com").text
soup = BeautifulSoup(html, 'html5lib')

first_paragraph = soup.find('p') # or just soup.p


first_paragraph_text = soup.p.text
first_paragraph_id = soup.p.get('id')
lecturers = soup('div', 'lecturer')
print(len(lecturers))
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 10
Which Tools?
– Lots of tools and programming frameworks available:
– Unix command line tool
– curl , grep , awk , perl, xpath, xmllint, xidel …
– 3rd Party tools
– eg. Google spreadsheets (ImportHTML() function)
– Many commercial solutions with nice 'click' interfaces and visualisations
– Can be expensive… Example: Import.io or Dexio.io many more…
– WebCrawlers-as-a-Service (eg. Scrapinghub)
– Programming libraries
– Eg. Pandas, BeautifulSoup library for Python; or frameworks like Scrapy
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 11
Example: NSW COVID-19 Data
– https://www.health.nsw.gov.au/news/Pages/2022-nsw-health.aspx

– Check logic and structure of website


– Inspect webpage structure; in example looking for a specific ship
• Reasonable HTML including some annotation and classes to identify data
part easily
• note any URL patterns
– Let’s try getting it – Unix? => DATA2901
• curl – transfer a URL to local machine
• xidel – parsing HTML and extracting sub-parts
• any texteditor of choice
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 12
For single webpages, Google spreadsheet can help
– ImportHTML
– URL
– “list” or “table”
– Index of which list or table to import from webpage

– Example (in Google Spreadsheet):


ImportHTML("https://www.health.nsw.gov.au/news/Pages/20220330_00.aspx",
"table", 1)

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 13
Some Tips from our TA Harshana and former Tutor Chris
– URL format is incredibly important to take note of
– Look at any parameters (e.g. flight data from Airport OnTime dataset:
http://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Perform
ance_1987_present_2021_1.zip - the year and month are parameterized and
can be looped over)
– Find any patterns with links when accessing data
(i.e. do they do it monthly, yearly, bi-weekly etc.)
– Access tokens (i.e. do they pass an API key or?)
– Web page structure is useful to note.
– Use the page inspector to narrow down what you're looking for.
– Complex tokenising can get messy (we might have to tokenise child nodes of the
elements)

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 14
Robots Exclusion Standard
– Many websites provide a robots.txt file
– Meant for web crawlers who should check this content first before starting
crawling a website
– Different rules in here Example: https://en.wikipedia.org/robots.txt

• Crawling/scraping allowed at all?


• Only specific subdirectories?
• Only certain programs (“user-agent”)?
• Which frequency (“request-rate”)?
– Cf. https://en.wikipedia.org/wiki/Robots_exclusion_standard
– Be a good net citizen:
Check, ask, don’t overload – and don’t steal (check copyright!)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 15
Is it Legal?
– Web scraping per itself is not illegal, you are free to save all
publicly data available on the internet to your computer.
– The way you will use that data is what might be illegal.
– Please read the website terms and conditions, and robots.txt,
and make sure you are not doing anything illegal :)

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 16
Data Cleaning
– This is a topic on its own…
– Data from website hardly is in a clean format
– neither from the format
– nor from the content

– Rules of thumb:
– Be prepared that things are different than they are supposed to be (“ , ; \t)
– Clean data before further processing or storage
• Eg. empty cells; placeholders; special characters or excess spaces;
• Do it programmatically so that you can re-use your solution
– Cross-check data consistency once loaded; eg. spelling variants of same entity?
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 17
Extracting Data from HTML

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 18
Web Retrieval in Python (Single Pages)
– Two forms of requests: GET (optional with parameters) or POST (with params)
– Making Simple Web Requests in Python
– Python requests library:
standard webpage: requests.get(URL)
webpage with parameters: requests.get(URL, params=dict(key=value,…))
web form (POST request): requests.post(URL, params=dict(key=value,…))

– Example:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.example.com")
print(response.status_code) # inspect response code of server
content = BeautifulSoup(response.text, 'html5lib')

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 19
Web Page Retrieval: URLs
– URL – Uniform Resource Locator
– “address” format on the web
– Example:
• https://www.health.nsw.gov.au/news/Pages/20220329_00.aspx
– General Format
• protocol://site/path_to_resource
• Typical protocols: http https ftp

– Can be scripted or programmed; more details later and in tutorials

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 20
Web Site Crawling in Python (multiple page scraping)
– Scrapy
– Extensive Python framework to implement a web ‘spider’ – a program that follows multiple links
along the web
– Can extend this spider class with own functionality which extracts parts of the visited pages while
the spider follows further links
– https://docs.scrapy.org/en/latest/intro/overview.html

– Selenium
– A programmable web browser for which a Python binding exists which allows to actually send
requests as if a user would have clicked on links or used a page (including running included
javascripts)
– Typically used for automatic testing of websites
– But can also be used for ‘crawling’ a complex interactive website
– https://selenium-python.readthedocs.io

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 21
HTML – Hypertext Markup Language
– Webpages are written in HTML
– Textual markup language that defines structure, content, and design
of a page as well as active elements (scripts, forms, etc.)
– Typically several additional files linked:
• CSS - cascading style sheets
• Scripts, Images, videos etc.
– Markup via open & closing tags in (e.g. <title>…</title>)
– Pre-defined in HTML standards (http://www.w3.org)
– Interpreted by web browsers for display
– HTML is designed to be interpreted by programs
– How to extract data with own programs?
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 22
HTML Example
<!DOCTYPE html>
<html>
<head>
<title>Literature List…</title>
</head>
<body>
<h1>References</h1>
<p>The following are some interesting links on web scraping:</p>
<div id=“biblist”>
<ul>
<li> "Data Science From Scratch", Chapter 23 </li>
<li> <a href=“http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-
data-journalists/”>Web Scraping for Data Journalists</a> </li>

</ul>
</div> …
</body>
</html>
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 23
General Structure of a Web Page
– Head
– title, style sheets, scripts, meta-data
– Body
– headings, text, lists, tables, images, forms etc.

– Wide variety of quality of web pages


– Some pages are automatically generated from CMS => not really human readable
– Some are heavy on design elements, others are more “structured”
– Web Page inspector of Web Browser
– Good for Reconnaissance Phase

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 24
How to Select Content in a Webpage?
– Four options:
– text patterns
• simple, but not really great for complex patterns as we rely on some own
parsing…
– DOM navigation
• Document object model
– CSS selectors
• based on the tag types, class specifications and IDs elements
• easy to specify, but depends on CSS classes and IDs being well used
– XPath expressions => Week 7
• powerful language that allows to navigate along document tree
and select all nodes or even sub-trees which match the path expr.
• can contain filter predicates, e.g. on values of XML/HTML attributes
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 25
Content Extraction with BeautifulSoup
Example for DOM-based navigation and data extraction:

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 26
Contact Extraction into Pandas DataFrame
– Example:
import pandas as pd
import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.example.com")
print(response.status_code) # inspect response code of server
content = BeautifulSoup(response.text, 'html5lib’)

table = content.find_all(‘table’)[0]
df = pd.read_html(str(table))[0] # only works with HTML tables
countries = df[“COUNTRY”].tolist()
print(countries)

# pretty print
from tabulate import tabulate
print ( tabulate(df[0], headers=“keys”, tablefmt=“psql”) )
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 27
HTML Document Model (DOM): Element-Tree
root BeautifulSoup’s
(document-
Element)
html dot notation
(cf. previous slide)
allows to follow
element a path along this
head body DOM tree
“title" is contained path
in “head"
title meta script h1 p div
id=“results” attribute
Albion content=“text/html” Albion Voyage 1823
table
Convict Ship 1823 class=“data”
Javascript
… Sailed to
Van Dieman’s
tr tr ...
Land in 1823. ...
th ... td
Sibling tags have an order from left to right! Convict William
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 28
CSS Selectors
– HTML elements can have multiple CSS class attributes associate for display,
as well as an ID attribute for identification
– Example: <table class=”data” id=“42”>
– CSS Selectors (the most important ones):
– Selecting an element e with a specific class: e.class
• E.g. <table class=“data”> => table.data
– Selecting an element e by ID: e#id
• E.g. <div id=“results”> => div#results or just #results
– Selecting by position within a parent element
• e:first-child e:last-child e:nth-child(n)
– Selecting instances out of multiple occurences
• e:nth-of-type(n)
– …
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 29
Using CSS Selectors with BeautifulSoup
– BeautifulSoup provides several functions that support CSS selectors:
find() find_all() select()
– Examples: (assuming page_content is a parsed webpage)
– Finding an element by type:
elements = page_content.find_all(“h3”)
for e in elements: …
– Finding an element by id:
element = page_content.find(id=“ship”)
– Finding table elements with a specific CSS class:
element = page_content.find_all(“table”, “data”)
– Look for tags matching general (complex) CSS selector:
elements = page_content.select(“#ship .data”)
for e in elements: …
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 30
Scraping HTML with Pandas
import pandas as pd

# pandas can also read directly from a URL – but only tables!
dfs =
pd.read_html(‘https://www.health.nsw.gov.au/infectious/diseases/Pages
/covid-19-latest.aspx’)
# scrapes tables as a list(!) of DataFrames
dfs[2].tail()

# plot as bar chart


%matplotlib inline
import matplotlib.pyplot as plt
f = plt.figure()
plt.title(“Covid case sources in NSW”)
dfs[2].plot.bar(x=“Source”, y=“Cases”, ax=f.gca())
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 31
Storing Scraped Web Data

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 32
Storing Extracted Data?
– Crawling web pages takes some time, hence a good idea to
store the data locally once extracted
– Avoids to re-crawl the remote servers every time for a new analysis
– Two main options
– File systems (CSV or XML files)
– Database

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 33
Storing in CSV files (plain Python)
– Assumption:
– Data extracted, cleaned and collected in Python arrays or dictionaries
– Export to CSV via Python example:
import csv
...
with open(”coviddata.csv", "w") as csvfile:
writer = csv.writer(csvfile) #use csv.DictWriter(…) if writing a Dictionary var
# nswstats = [
# [”Adamant", ”26.3.1821", "https://convictrecords.com.au/ships/adamant"],
# [”Albion", ”29.5.1828", "https://convictrecords.com.au/ships/albion"],
# ...
# ]
for s in nswstats:
writer.writerow(s)
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 34
Storing in CSV files using Pandas
import pandas as pd
import requests
from bs4 import BeautifulSoup

page =requests.get(" https://www.health.nsw.gov.au/infectious/diseases/Pages/covid-19-latest.aspx


")
content = BeautifulSoup(page.text, 'html5lib’)

# data = content.find_all(‘#in-australia’)[0] # page structure has changed


# here we are applying a double filter – find all table elements (tags+sub-tags)
# then only retrieve the tables with a class=“moh-rteTable-6” attribute
data = content.find_all("table", {"class" : "moh-rteTable-6"})
df = pd.read_html(str(data))[0]
df.tail()

df.to_csv(‘covid_stats_nsw_by_age_group.csv’)

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 35
Storing extracted data in Databases
– If data is structured and already prepared, this is pretty straight forward
(Assumption: Data extracted, cleaned and collected in Python arrays or dictionaries)
– Export to SQL database via Python, example:
import psycopg2
def pgconnect(): …
def pgquery(): …

# 1st: login to database


conn = pgconnect()

# 2nd: ensure that the schema is in place


pgquery (conn, "CREATE TABLE IF NOT EXISTS Ships ( name TEXT, … )”, None)

# 3rd: load data (assuming dictionary variable ‘ships’ with given keys)
insert_stmt = "INSERT INTO Ships VALUES (%(name)s,%(last_voyage),%(url))"
for s in ships:
pgquery (conn, insert_stmt, s) // alternatively use pandas’ df.to_sql()
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 36
Lessons Learned
– Web Scraping
– Steps: exploring, crawling, parsing, cleaning, storing&analysis
– Many tools and support libraries
– Scraping web pages with Python using request, beautifulsoup, lxml, ...
– HTML and XML are Semi-structured Data Models
– Data models that can handle variants and optional attributes
– Self-describing; does not require schema first (still: valid vs. well-formed)
– Central model: tree
– Storing extracted Web Data
– Storage of scraped data and even XML in files or databases possible
– querying XML is difficult because of the nested, graph-like structure
DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 37
Next Week
– SQL Test (Wed, 5th April, 12pm AEST, online via ED)

– Retrieving data from Web Services


– Introduction to semi-structured Data
– XML and JSON
– NoSQL databases

DATA2001 "Data Science, Big Data and Data Diversity" - 2022 (Roehm) 38
References
– "Data Science From Scratch", Chapter 23
– Python Libraries:
– PIP: sudo python -m ensurepip --default-pip
– Requests library (‘pip install requests’) - http://docs.python-requests.org/en/master/
– BeautifulSoup4 (`pip install bs4`) - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
– LXML (‘pip install lxml’) - https://lxml.de/lxmlhtml.html
– Scrapy - https://docs.scrapy.org/en/latest/intro/overview.html
– Selenium - https://selenium-python.readthedocs.io
– Semistructured Data, XML: http://www.w3.org/TR/xml
– PostgreSQL Online documentation
– http://www.postgresql.org/docs/current/static/
– http://www.postgresql.org/docs/current/static/datatype-xml.html
– http://www.postgresql.org/docs/current/static/functions-aggregate.html
– General tips
– http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/
https://ianlondon.github.io/blog/web-scraping-discovering-hidden-apis/
https://github.com/stanfordjournalism/search-script-scrape
https://blog.hartleybrody.com/web-scraping-cheat-sheet/
https://bigishdata.com/2017/06/06/web-scraping-with-python-part-two-library-overview-of-requests-urllib2-beautifulsoup-
DATA2001 "Data Science,lxml-scrapy-and-more/
Big Data and Data Diversity" - 2022 (Roehm) 39

You might also like