ibm-python-module-5-web-scraping-pandas

The document explains how to use the Pandas library in Python to extract tables from web pages using the read_html() function. It provides examples of extracting tables from Wikipedia pages, highlighting potential limitations such as the presence of hyperlinks and the need for data cleaning. It also notes that while Pandas is useful for tabular data, BeautifulSoup is recommended for extracting other types of information from web pages.

Uploaded by

omegapesofficial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

ibm-python-module-5-web-scraping-pandas

Uploaded by

omegapesofficial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Web Scraping Tables using Pandas

Estimated Effort: 5 mins

The Pandas library in Python contains a function read_html() that can be used to extract tabular information from any web page.

Consider the following example:

Let us assume we want to extract the list of the largest banks in the world by market capitalization, from the following link:
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'

We may use pandas.read_html() function in python to extract all the tables in the web page directly.

A snapshot of the webpage is shown below.

We can see that the required table is the first one in the web page.

Note: This is a live web page and it may get updated over time. The image shown above has been captured in November 2023. The process of data
extraction remains the same.

We may execute the following lines of code to extract the required table from the web page.
import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(URL)
df = tables[0]
print(df)

This will extract the required table as a dataframe df. The output of the print statement would look as shown below.
Although convenient, this method comes with its own set of limitations.
Firstly, web pages may have content saved in them as tables but they may not appear as tables on the web page.

For instance, consider the following URL showing the list of countries by GDP (nominal).

URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

The images on the web page are also saved in tabular format. A snapshot of the web page is shared below.

Secondly, the contents of the tables in the web pages may contain elements such as hyperlink text and other denoters, which are also scraped directly using the
pandas method. This may lead to a requirement of further cleaning of data.
A closer look at table 3 in the image shown above indicates that there are many hyperlink texts which are also going to be treated as information by the pandas
function.
We can extract the table using the code shown below.
import pandas as pd
URL = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(URL)
df = tables(2) # the required table will have index 2
print(df)

The output of the print statement is shown below.

Note that the hyperlink texts have also been retained in the code output.

It is further prudent to point out, that this method exclusively operates only on tabular data extraction. BeautifulSoup library still remains the default method of
extracting any kind of information from web pages.

Author(s)
Abhishek Gagneja

Artis Zee
100% (7)
Artis Zee
318 pages
Cambridge IGCSE: Computer Science 0478/22
No ratings yet
Cambridge IGCSE: Computer Science 0478/22
2 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Python Pandas Tutorial PDF
100% (1)
Python Pandas Tutorial PDF
13 pages
MS-Excel Practical File
No ratings yet
MS-Excel Practical File
69 pages
List of IEC Standards
100% (2)
List of IEC Standards
12 pages
How To Develop A Performance Reporting Tool with MS Excel and MS SharePoint
From Everand
How To Develop A Performance Reporting Tool with MS Excel and MS SharePoint
S. Alyafei
No ratings yet
200336.055-en
No ratings yet
200336.055-en
2 pages
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Building Responsive Data Visualization for the Web
From Everand
Building Responsive Data Visualization for the Web
Bill Hinderman
No ratings yet
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
Scraping Document
No ratings yet
Scraping Document
5 pages
The Basics of Pandas Library
No ratings yet
The Basics of Pandas Library
8 pages
Practice Project
No ratings yet
Practice Project
4 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
RM - Pandas_Importing Data
No ratings yet
RM - Pandas_Importing Data
15 pages
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
scrapeez
No ratings yet
scrapeez
3 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping Wikipedia Tables Into Python Dataframe - Analytics Vidhya
No ratings yet
Web Scraping Wikipedia Tables Into Python Dataframe - Analytics Vidhya
5 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
4251 Assignment 2
No ratings yet
4251 Assignment 2
9 pages
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now
No ratings yet
While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now
8 pages
Web Development Step by Step
From Everand
Web Development Step by Step
Enrique Vicente
No ratings yet
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
Pandas
No ratings yet
Pandas
40 pages
20_BeautifulSoup Library for Web Scraping
No ratings yet
20_BeautifulSoup Library for Web Scraping
12 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
No ratings yet
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
8 pages
Microsoft Access 2003
From Everand
Microsoft Access 2003
Jitendra Patel
5/5 (1)
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
From Everand
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
Tim Warren
No ratings yet
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
scraping
No ratings yet
scraping
6 pages
24
No ratings yet
24
7 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
122 pages
Introduction To Pandas Takeaways
No ratings yet
Introduction To Pandas Takeaways
2 pages
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
From Everand
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Kim Chantala
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Microsoft Power Platform For Dummies
From Everand
Microsoft Power Platform For Dummies
Jack A. Hyman
No ratings yet
SAS For Dummies
From Everand
SAS For Dummies
Chris Hemedinger
No ratings yet
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
45 pages
P.H.P Simple C.R.U.D Design
From Everand
P.H.P Simple C.R.U.D Design
Rohaya Mohamad
4/5 (1)
Creating your MySQL Database: Practical Design Tips and Techniques
From Everand
Creating your MySQL Database: Practical Design Tips and Techniques
Marc Delisle
3/5 (1)
PYTHON MODULE-4
No ratings yet
PYTHON MODULE-4
109 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
WPA S800 English Issue 02
No ratings yet
WPA S800 English Issue 02
22 pages
Adobe Illustrator Keyboard Shortcuts Cheat Sheet
No ratings yet
Adobe Illustrator Keyboard Shortcuts Cheat Sheet
1 page
Minor Project Report
No ratings yet
Minor Project Report
23 pages
Using Look-Ahead Plans To Improve Material Ow Processes On Construction Projects When Using BIM and RFID Technologies
No ratings yet
Using Look-Ahead Plans To Improve Material Ow Processes On Construction Projects When Using BIM and RFID Technologies
38 pages
Lec 15 COAL Interrupts
100% (1)
Lec 15 COAL Interrupts
13 pages
Map Id List
No ratings yet
Map Id List
4 pages
DLP Report Generation SOP
No ratings yet
DLP Report Generation SOP
18 pages
SANS Cheat Sheet 1662156164
No ratings yet
SANS Cheat Sheet 1662156164
1 page
Handleiding NL Virtual Machine
No ratings yet
Handleiding NL Virtual Machine
280 pages
NT00337-EN-01 - Flair 23DM Modbus Protocol PDF
No ratings yet
NT00337-EN-01 - Flair 23DM Modbus Protocol PDF
32 pages
KNN Algorithm
No ratings yet
KNN Algorithm
3 pages
Course Tittle:-Project Title:-: Object Oriented Software Analysis and Design
100% (1)
Course Tittle:-Project Title:-: Object Oriented Software Analysis and Design
24 pages
Le Quoc Huy Backend Java Developer TopCV - VN 281223.135647
No ratings yet
Le Quoc Huy Backend Java Developer TopCV - VN 281223.135647
2 pages
Review Paper Final Taranjot
No ratings yet
Review Paper Final Taranjot
7 pages
Civil 3D Online - Course - Fee1
No ratings yet
Civil 3D Online - Course - Fee1
4 pages
Trafic Offence Proposal
No ratings yet
Trafic Offence Proposal
7 pages
Grub Dev
No ratings yet
Grub Dev
70 pages
jhtp10 ch03 ch08 Week04
No ratings yet
jhtp10 ch03 ch08 Week04
102 pages
SAP Workflow
100% (2)
SAP Workflow
30 pages
Iat Answer Key 2018
No ratings yet
Iat Answer Key 2018
1 page
Knowledge Check - Training - Microsoft Learn 3
No ratings yet
Knowledge Check - Training - Microsoft Learn 3
2 pages
Early Stage Lung Cancer Prediction Using Various Machine Learning Techniques
No ratings yet
Early Stage Lung Cancer Prediction Using Various Machine Learning Techniques
8 pages
entire_flash_reset_motorola_tetra_radios_eng
No ratings yet
entire_flash_reset_motorola_tetra_radios_eng
6 pages
LTM Fundamentals Exercise Guide - Partners - V13.0.K
No ratings yet
LTM Fundamentals Exercise Guide - Partners - V13.0.K
193 pages
24 Itt Questions
No ratings yet
24 Itt Questions
451 pages
Major Project Report
No ratings yet
Major Project Report
47 pages