20

The document explains how to use the Pandas library in Python to scrape tables from web pages using the read_html() function. It provides examples of extracting tables from Wikipedia pages and highlights limitations such as the presence of hyperlinks in the scraped data. The document also notes that while Pandas is useful for tabular data, BeautifulSoup is recommended for more general web scraping tasks.

Uploaded by

Edwin Pimiento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

20

Uploaded by

Edwin Pimiento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

15/5/25, 11:45 a.m.

about:blank

Web Scraping Tables using Pandas

Estimated Effort: 5 mins

The Pandas library in Python contains a function read_html() that can be used to extract tabular information from any web page.

Consider the following example:

Let us assume we want to extract the list of the largest banks in the world by market capitalization, from the following link:
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'

We may use pandas.read_html() function in python to extract all the tables in the web page directly.

A snapshot of the webpage is shown below.

We can see that the required table is the first one in the web page.

Note: This is a live web page and it may get updated over time. The image shown above has been captured in November 2023. The process of data extraction
remains the same.

about:blank 1/4
15/5/25, 11:45 a.m. about:blank
We may execute the following lines of code to extract the required table from the web page.

import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(URL)
df = tables[0]
print(df)

This will extract the required table as a dataframe df. The output of the print statement would look as shown below.

Although convenient, this method comes with its own set of limitations.
Firstly, web pages may have content saved in them as tables but they may not appear as tables on the web page.

For instance, consider the following URL showing the list of countries by GDP (nominal).

URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

about:blank 2/4
15/5/25, 11:45 a.m. about:blank
The images on the web page are also saved in tabular format. A snapshot of the web page is shared below.

Secondly, the contents of the tables in the web pages may contain elements such as hyperlink text and other denoters, which are also scraped directly using the pandas
method. This may lead to a requirement of further cleaning of data.
A closer look at table 3 in the image shown above indicates that there are many hyperlink texts which are also going to be treated as information by the pandas function.

about:blank 3/4
15/5/25, 11:45 a.m. about:blank

We can extract the table using the code shown below.

import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(URL)
df = tables(2) # the required table will have index 2
print(df)

The output of the print statement is shown below.

Note that the hyperlink texts have also been retained in the code output.

It is further prudent to point out, that this method exclusively operates only on tabular data extraction. BeautifulSoup library still remains the default method of extracting
any kind of information from web pages.

Author(s)
Abhishek Gagneja

about:blank 4/4

The Stage Manager Toolkit
No ratings yet
The Stage Manager Toolkit
238 pages
The Official Girlfriend Application: Basic Information
No ratings yet
The Official Girlfriend Application: Basic Information
3 pages
(Mantak Chia) Chi Nei Tsang II PDF
100% (2)
(Mantak Chia) Chi Nei Tsang II PDF
66 pages
Municipal Trial Court in Cities
100% (1)
Municipal Trial Court in Cities
6 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Python Pandas Tutorial PDF
100% (1)
Python Pandas Tutorial PDF
13 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Swift User Guide
100% (1)
Swift User Guide
48 pages
ibm-python-module-5-web-scraping-pandas
No ratings yet
ibm-python-module-5-web-scraping-pandas
3 pages
Practice Project
No ratings yet
Practice Project
4 pages
Lecture Week 5-Data Analytics-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 5-Data Analytics-Data Scraping and Data Wrangling
15 pages
P.H.P Simple C.R.U.D Design
From Everand
P.H.P Simple C.R.U.D Design
Rohaya Mohamad
4/5 (1)
200336.055-en
No ratings yet
200336.055-en
2 pages
PHP MySQL Development of Login Modul: 3 hours Easy Guide
From Everand
PHP MySQL Development of Login Modul: 3 hours Easy Guide
Esstree Ishak Abdullah
5/5 (1)
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
Web Scraping Wikipedia Tables Into Python Dataframe - Analytics Vidhya
No ratings yet
Web Scraping Wikipedia Tables Into Python Dataframe - Analytics Vidhya
5 pages
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
No ratings yet
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
8 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
4251 Assignment 2
No ratings yet
4251 Assignment 2
9 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
122 pages
RM - Pandas_Importing Data
No ratings yet
RM - Pandas_Importing Data
15 pages
UNIT -4 -PART 2
No ratings yet
UNIT -4 -PART 2
36 pages
intro-to-pandas-world-happiness
No ratings yet
intro-to-pandas-world-happiness
20 pages
CommonMark Ready Reference
From Everand
CommonMark Ready Reference
V. Subhash
No ratings yet
Getting started with OpenOffice Base
From Everand
Getting started with OpenOffice Base
Remy Lentzner
No ratings yet
24
No ratings yet
24
7 pages
Jupyter Notebook.docx
No ratings yet
Jupyter Notebook.docx
71 pages
mypnotes
No ratings yet
mypnotes
3 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Python for ML
No ratings yet
Python for ML
41 pages
Python-for-Data-Analysis-edgar
No ratings yet
Python-for-Data-Analysis-edgar
49 pages
Pandas
No ratings yet
Pandas
40 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Scraping Document
No ratings yet
Scraping Document
5 pages
Introduction To Pandas Takeaways
No ratings yet
Introduction To Pandas Takeaways
2 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
Exercises 5
No ratings yet
Exercises 5
7 pages
Configuration of Apache Server to Support Asp
From Everand
Configuration of Apache Server to Support Asp
Dr. Hidaia Mahmood Alassouli
No ratings yet
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Table Extraction
No ratings yet
Table Extraction
2 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
panda_1
No ratings yet
panda_1
18 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
Learning Pandas - Sample Chapter
No ratings yet
Learning Pandas - Sample Chapter
30 pages
scrapeez
No ratings yet
scrapeez
3 pages
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Introduction To Pandas: Takeaways: Syntax
No ratings yet
Introduction To Pandas: Takeaways: Syntax
2 pages
Getting started with php & mysql: Professional training
From Everand
Getting started with php & mysql: Professional training
Rémy Lentzner
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Christos Chen
No ratings yet
Christos Chen
42 pages
The Basics of Pandas Library
No ratings yet
The Basics of Pandas Library
8 pages
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
28 pages
Chapter 5 - Data Exploration and Visualization With
No ratings yet
Chapter 5 - Data Exploration and Visualization With
39 pages
Jeran Assessments Bsu 2nd Sem
No ratings yet
Jeran Assessments Bsu 2nd Sem
1 page
Surya Medika: Niken Setyaningrum Agung Rejecky
No ratings yet
Surya Medika: Niken Setyaningrum Agung Rejecky
5 pages
National Plan of Action For Children in Kenya 2015 2022
No ratings yet
National Plan of Action For Children in Kenya 2015 2022
84 pages
Minicab Front Axle
No ratings yet
Minicab Front Axle
42 pages
Mode of Production Debate
100% (1)
Mode of Production Debate
9 pages
Requirement of Video-Conference System For 3 Localized Conference Rooms
No ratings yet
Requirement of Video-Conference System For 3 Localized Conference Rooms
9 pages
Philosophy Ordinance Final
No ratings yet
Philosophy Ordinance Final
17 pages
Company Profile VH en
No ratings yet
Company Profile VH en
32 pages
Unit 14
No ratings yet
Unit 14
15 pages
311302-Basic Mathematics
No ratings yet
311302-Basic Mathematics
9 pages
CSS Frontpage
No ratings yet
CSS Frontpage
3 pages
Irrawaddy Dolphin A4 Factsheet
No ratings yet
Irrawaddy Dolphin A4 Factsheet
1 page
The Handmaid's Tale: The Aspects of A Totalitarian System Viewed in The Novel
No ratings yet
The Handmaid's Tale: The Aspects of A Totalitarian System Viewed in The Novel
3 pages
Jeep Grand Cherokee Panoramic Sunroof Problems and Fix
No ratings yet
Jeep Grand Cherokee Panoramic Sunroof Problems and Fix
6 pages
Online UG Calendar Jan 2022 SEM VI Students
No ratings yet
Online UG Calendar Jan 2022 SEM VI Students
3 pages
Mini Ahu Technical
No ratings yet
Mini Ahu Technical
6 pages
National Learning Camp Matrix
No ratings yet
National Learning Camp Matrix
22 pages
Accomplishment Report Grad
No ratings yet
Accomplishment Report Grad
11 pages
REFINERS FIRE CHORDS by Misc Praise Songs at Ultim
No ratings yet
REFINERS FIRE CHORDS by Misc Praise Songs at Ultim
1 page
QC-module
No ratings yet
QC-module
137 pages
Edexcel Foundation Ratio
No ratings yet
Edexcel Foundation Ratio
37 pages
Commissioning and Configuration GPON
80% (5)
Commissioning and Configuration GPON
608 pages
A Criterion To Define Cross-Flow Fan Design Parameters
No ratings yet
A Criterion To Define Cross-Flow Fan Design Parameters
4 pages
BTS Mini Link and Maintenance Issue
No ratings yet
BTS Mini Link and Maintenance Issue
6 pages
Third Chapter Lotus Sutra - 3 of 28
No ratings yet
Third Chapter Lotus Sutra - 3 of 28
3 pages