0% found this document useful (0 votes)
5 views4 pages

20

The document explains how to use the Pandas library in Python to scrape tables from web pages using the read_html() function. It provides examples of extracting tables from Wikipedia pages and highlights limitations such as the presence of hyperlinks in the scraped data. The document also notes that while Pandas is useful for tabular data, BeautifulSoup is recommended for more general web scraping tasks.

Uploaded by

Edwin Pimiento
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

20

The document explains how to use the Pandas library in Python to scrape tables from web pages using the read_html() function. It provides examples of extracting tables from Wikipedia pages and highlights limitations such as the presence of hyperlinks in the scraped data. The document also notes that while Pandas is useful for tabular data, BeautifulSoup is recommended for more general web scraping tasks.

Uploaded by

Edwin Pimiento
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

15/5/25, 11:45 a.m.

about:blank

Web Scraping Tables using Pandas


Estimated Effort: 5 mins

The Pandas library in Python contains a function read_html() that can be used to extract tabular information from any web page.

Consider the following example:

Let us assume we want to extract the list of the largest banks in the world by market capitalization, from the following link:
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'

We may use pandas.read_html() function in python to extract all the tables in the web page directly.

A snapshot of the webpage is shown below.

We can see that the required table is the first one in the web page.

Note: This is a live web page and it may get updated over time. The image shown above has been captured in November 2023. The process of data extraction
remains the same.

about:blank 1/4
15/5/25, 11:45 a.m. about:blank
We may execute the following lines of code to extract the required table from the web page.

import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(URL)
df = tables[0]
print(df)

This will extract the required table as a dataframe df. The output of the print statement would look as shown below.

Although convenient, this method comes with its own set of limitations.
Firstly, web pages may have content saved in them as tables but they may not appear as tables on the web page.

For instance, consider the following URL showing the list of countries by GDP (nominal).

URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

about:blank 2/4
15/5/25, 11:45 a.m. about:blank
The images on the web page are also saved in tabular format. A snapshot of the web page is shared below.

Secondly, the contents of the tables in the web pages may contain elements such as hyperlink text and other denoters, which are also scraped directly using the pandas
method. This may lead to a requirement of further cleaning of data.
A closer look at table 3 in the image shown above indicates that there are many hyperlink texts which are also going to be treated as information by the pandas function.

about:blank 3/4
15/5/25, 11:45 a.m. about:blank

We can extract the table using the code shown below.


import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(URL)
df = tables(2) # the required table will have index 2
print(df)

The output of the print statement is shown below.

Note that the hyperlink texts have also been retained in the code output.

It is further prudent to point out, that this method exclusively operates only on tabular data extraction. BeautifulSoup library still remains the default method of extracting
any kind of information from web pages.

Author(s)
Abhishek Gagneja

about:blank 4/4

You might also like