0% found this document useful (0 votes)

71 views

Implementing Web Scraping in Python With Beautifulsoup

This document discusses implementing web scraping in Python using Beautiful Soup. It describes sending an HTTP request to access a webpage's HTML content, parsing the HTML with Beautiful Soup to create a nested structure, and then navigating and searching the parsed tree to extract useful data like quotes from a webpage. The quotes are extracted by finding specific HTML elements, then stored in a dictionary and written to a CSV file.

Uploaded by

VêRmã Sàáb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Implementing Web Scraping in Python With Beautifulsoup

Uploaded by

VêRmã Sàáb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Implementing Web Scraping in Python with

BeautifulSoup
 Difficulty Level : Medium

There are mainly two ways to extract data from a website:

 Use the API of the website (if it exists). For example, Facebook has the
Facebook Graph API which allows retrieval of data posted on Facebook.
 Access the HTML of the webpage and extract useful information/data from it.
This technique is called web scraping or web harvesting or web data extraction.
This article discusses the steps involved in web scraping using the implementation of a
Web Scraping framework of Python called Beautiful Soup.
Steps involved in web scraping:
1. Send an HTTP request to the URL of the webpage you want to access. The
server responds to the request by returning the HTML content of the webpage.
For this task, we will use a third-party HTTP library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing
the data. Since most of the HTML data is nested, we cannot extract data simply
through string processing. One needs a parser which can create a nested/tree
structure of the HTML data. There are many HTML parser libraries available
but the most advanced one is html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we
created, i.e. tree traversal. For this task, we will be using another third-party
python library, Beautiful Soup. It is a Python library for pulling data out of
HTML and XML files.
Step 1: Installing the required third-party libraries
 Easiest way to install external libraries in python is to use pip. pip is a package
management system used to install and manage software packages written in
Python.
All you need to do is:
pip install requests
pip install html5lib
pip install bs4
 Another way is to download them manually from these links:
 requests
 html5lib
 beautifulsoup4
Step 2: Accessing the HTML content from webpage
import requests

URL = "https://www.geeksforgeeks.org/data-structures/"

r = requests.get(URL)

print(r.content)

Let us try to understand this piece of code.

 First of all import the requests library.
 Then, specify the URL of the webpage you want to scrape.
 Send a HTTP request to the specified URL and save the response from server
in a response object called r.

 Now, as print r.content to get the raw HTML content of the webpage. It is of

‘string’ type.
Step 3: Parsing the HTML content

#This will not run on online IDE

import requests

from bs4 import BeautifulSoup

URL = "http://www.values.com/inspirational-quotes"

r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install
html5lib

print(soup.prettify())
A really nice thing about the BeautifulSoup library is that it is built on the top of the
HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object
and specify the parser library can be created at the same time.
In the example above,
soup = BeautifulSoup(r.content, 'html5lib')
We create a BeautifulSoup object by passing two arguments:
 r.content : It is the raw HTML content.
 html5lib : Specifying the HTML parser we want to use.
Now soup.prettify() is printed, it gives the visual representation of the parse tree created
from the raw HTML content.
Step 4: Searching and navigating through the parse tree
Now, we would like to extract some useful data from the HTML content. The soup object
contains all the data in the nested structure which could be programmatically extracted. In
our example, we are scraping a webpage consisting of some quotes. So, we would like to
create a program to save those quotes (and all relevant information about them).

#Python program to scrape website

#and save quotes from website

import requests

from bs4 import BeautifulSoup

import csv

URL = "http://www.values.com/inspirational-quotes"

r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

quotes=[] # a list to store quotes

table = soup.find('div', attrs = {'id':'all_quotes'})

for row in table.findAll('div',

attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):

quote = {}

quote['theme'] = row.h5.text

quote['url'] = row.a['href']

quote['img'] = row.img['src']

quote['lines'] = row.img['alt'].split(" #")[0]

quote['author'] = row.img['alt'].split(" #")[1]

quotes.append(quote)

filename = 'inspirational_quotes.csv'

with open(filename, 'w', newline='') as f:

w = csv.DictWriter(f,['theme','url','img','lines','author'])

w.writeheader()
for quote in quotes:

w.writerow(quote)

Before moving on, we recommend you to go through the HTML content of the webpage
which we printed using soup.prettify() method and try to find a pattern or a way to
navigate to the quotes.
 It is noticed that all the quotes are inside a div container whose id is
‘all_quotes’. So, we find that div element (termed as table in above code)
using find() method :
table = soup.find('div', attrs = {'id':'all_quotes'})
The first argument is the HTML tag you want to search and second argument is
a dictionary type element to specify the additional attributes associated with
that tag. find() method returns the first matching element. You can try to
print table.prettify() to get a sense of what this piece of code does.
 Now, in the table element, one can notice that each quote is inside a div
container whose class is quote. So, we iterate through each div container whose
class is quote.
Here, we use findAll() method which is similar to find method in terms of
arguments but it returns a list of all matching elements. Each quote is now
iterated using a variable called row.
Here is one sample row HTML content for better understanding:

Now consider this piece of code:

 for row in table.find_all_next('div', attrs = {'class': 'col-6 col-lg-3 text-center margin-
30px-bottom sm-margin-30px-top'}):
 quote = {}
 quote['theme'] = row.h5.text
 quote['url'] = row.a['href']
 quote['img'] = row.img['src']
 quote['lines'] = row.img['alt'].split(" #")[0]
 quote['author'] = row.img['alt'].split(" #")[1]
quotes.append(quote)
We create a dictionary to save all information about a quote. The nested
structure can be accessed using dot notation. To access the text inside an
HTML element, we use .text :
quote['theme'] = row.h5.text
We can add, remove, modify and access a tag’s attributes. This is done by
treating the tag as a dictionary:
quote['url'] = row.a['href']
Lastly, all the quotes are appended to the list called quotes.
 Finally, we would like to save all our data in some CSV file.
 filename = 'inspirational_quotes.csv'
 with open(filename, 'w', newline='') as f:
 w = csv.DictWriter(f,['theme','url','img','lines','author'])
 w.writeheader()
 for quote in quotes:
w.writerow(quote)
Here we create a CSV file called inspirational_quotes.csv and save all the
quotes in it for any further use.

Busi 330-b02 CMP Final Draft Group 1
100% (1)
Busi 330-b02 CMP Final Draft Group 1
29 pages
AFC System To AFC Device Interface Specification v1.3
No ratings yet
AFC System To AFC Device Interface Specification v1.3
24 pages
Siteground Dreamweaver Tutorial
100% (3)
Siteground Dreamweaver Tutorial
27 pages
Windows 10 - Getting Started With Windows 10
No ratings yet
Windows 10 - Getting Started With Windows 10
6 pages
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
No ratings yet
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
6 pages
MT6580 Android Scatter
60% (5)
MT6580 Android Scatter
7 pages
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
Windows Tips Collection: How To Hack Windows XP Admin Password
No ratings yet
Windows Tips Collection: How To Hack Windows XP Admin Password
5 pages
WWW Fullsoftware4u Com Windows 8-1-32 Bit 64 Bit Download FR
No ratings yet
WWW Fullsoftware4u Com Windows 8-1-32 Bit 64 Bit Download FR
9 pages
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
Windows Hacking Tricks
No ratings yet
Windows Hacking Tricks
4 pages
Blog Keywords: 16 Tips To Select Better Keywords For Your Articles
No ratings yet
Blog Keywords: 16 Tips To Select Better Keywords For Your Articles
11 pages
XP Tricks LL Some Call It Hack
No ratings yet
XP Tricks LL Some Call It Hack
96 pages
White Paper MobileAppStrategy
No ratings yet
White Paper MobileAppStrategy
8 pages
Windows Tips Collection
No ratings yet
Windows Tips Collection
8 pages
JSN Decor Customization Manual
No ratings yet
JSN Decor Customization Manual
31 pages
Making It Easier For Developers: Mobile Application Development
No ratings yet
Making It Easier For Developers: Mobile Application Development
22 pages
Web Browser
No ratings yet
Web Browser
14 pages
Solar Cell Presentation
No ratings yet
Solar Cell Presentation
45 pages
07. Javascript Object Notation
No ratings yet
07. Javascript Object Notation
17 pages
HTML SYlABUS
No ratings yet
HTML SYlABUS
50 pages
A Project On Various Browsers and Free Tools
No ratings yet
A Project On Various Browsers and Free Tools
18 pages
Introduction-to-JSON
No ratings yet
Introduction-to-JSON
1 page
A Seminar Report On "Android": Submitted in Partial Fulfillment of The Requirements For BCA 1st Year
No ratings yet
A Seminar Report On "Android": Submitted in Partial Fulfillment of The Requirements For BCA 1st Year
31 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Learn Angular: Build a Todo App
From Everand
Learn Angular: Build a Todo App
Jurgen van de Moere
No ratings yet
Kill Disk
100% (1)
Kill Disk
47 pages
PowerShell Ebook
No ratings yet
PowerShell Ebook
301 pages
100 Ai Tools 7w1ufhqz Compressed
No ratings yet
100 Ai Tools 7w1ufhqz Compressed
115 pages
MSF Console Beginner To Advance
No ratings yet
MSF Console Beginner To Advance
52 pages
Introduction To Mojo Programming Language
No ratings yet
Introduction To Mojo Programming Language
2 pages
HTML Cheat Sheet v1
No ratings yet
HTML Cheat Sheet v1
1 page
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
No ratings yet
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
4 pages
Cython Tutorial: Release 0.28.2
No ratings yet
Cython Tutorial: Release 0.28.2
81 pages
Web Programming Lab Manual
No ratings yet
Web Programming Lab Manual
57 pages
Distil Networks Ebook Web Scraping
0% (1)
Distil Networks Ebook Web Scraping
19 pages
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
100% (1)
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
20 pages
Software Engineering ICT 4103 Assignment - 01
No ratings yet
Software Engineering ICT 4103 Assignment - 01
35 pages
The Most Comprehensive WordPress Course
From Everand
The Most Comprehensive WordPress Course
Amplent
No ratings yet
WP Integrating Active Directory
No ratings yet
WP Integrating Active Directory
28 pages
RSSeo-Step by Step Guide
No ratings yet
RSSeo-Step by Step Guide
53 pages
Languages Use For Mobile Devices
100% (1)
Languages Use For Mobile Devices
17 pages
Top 100 Test Tools (List and Classify) : Loadstorm
No ratings yet
Top 100 Test Tools (List and Classify) : Loadstorm
64 pages
HTML Basic - : 4 Examples
No ratings yet
HTML Basic - : 4 Examples
2 pages
The One Minute Manager
No ratings yet
The One Minute Manager
76 pages
HTML, XML, Javascript N Css
100% (1)
HTML, XML, Javascript N Css
51 pages
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
No ratings yet
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
5 pages
Reference Guide: 3. Scratch Blocks
No ratings yet
Reference Guide: 3. Scratch Blocks
8 pages
JSON Tutorial For Beginners
No ratings yet
JSON Tutorial For Beginners
7 pages
HTML DHTML and Javascript
94% (31)
HTML DHTML and Javascript
230 pages
INNOVATIVE PROJECT On PORTFOLIO
No ratings yet
INNOVATIVE PROJECT On PORTFOLIO
14 pages
Mojo VS Python
100% (1)
Mojo VS Python
2 pages
Installing and Using Tesseract 500 OCRFINAL
No ratings yet
Installing and Using Tesseract 500 OCRFINAL
4 pages
Stay Smart Online
No ratings yet
Stay Smart Online
11 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Asm
No ratings yet
Asm
156 pages
List of Terms Mentioned in The Keyword Research Course
No ratings yet
List of Terms Mentioned in The Keyword Research Course
9 pages
Crafting HTML Email
From Everand
Crafting HTML Email
Remi Parmentier
No ratings yet
Javascript - 50 functions and tutorial
From Everand
Javascript - 50 functions and tutorial
Nino Paiotta
No ratings yet
Ultimate Web API Development with Django REST Framework: Build Robust and Secure Web APIs with Django REST Framework Using Test-Driven Development for Data Analysis and Management
From Everand
Ultimate Web API Development with Django REST Framework: Build Robust and Secure Web APIs with Django REST Framework Using Test-Driven Development for Data Analysis and Management
Leonardo Luis
No ratings yet
The complete guide to Hardware Technician Terminology: A simplified guide
From Everand
The complete guide to Hardware Technician Terminology: A simplified guide
Sumitra Kumari
No ratings yet
Test 2
No ratings yet
Test 2
2 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Door Lock System
No ratings yet
Door Lock System
18 pages
Book
No ratings yet
Book
1 page
Linear Control Systems: Inst Ruct Ions T O Candidat Es
No ratings yet
Linear Control Systems: Inst Ruct Ions T O Candidat Es
2 pages
Survey On Artificial Intelligence (AI)
No ratings yet
Survey On Artificial Intelligence (AI)
5 pages
By K. Vijay Kumar Assistant Professor Dept. of ECE
No ratings yet
By K. Vijay Kumar Assistant Professor Dept. of ECE
31 pages
Unit 1
No ratings yet
Unit 1
77 pages
Tcms
No ratings yet
Tcms
139 pages
Computer Networks-Lab: Hareem Aslam Hareem - Aslam@pucit - Edu.pk
No ratings yet
Computer Networks-Lab: Hareem Aslam Hareem - Aslam@pucit - Edu.pk
22 pages
Binomial and Doubly Ended Queue
No ratings yet
Binomial and Doubly Ended Queue
23 pages
IEEE 118 Bus Data
No ratings yet
IEEE 118 Bus Data
16 pages
all-important-excel-formulas-ato-z-library
No ratings yet
all-important-excel-formulas-ato-z-library
6 pages
SAVEO-EMB-S2D Full Technical User Manual Rev 0.1
No ratings yet
SAVEO-EMB-S2D Full Technical User Manual Rev 0.1
228 pages
Quick Start Guide: Get Started in 5 Minutes!
No ratings yet
Quick Start Guide: Get Started in 5 Minutes!
7 pages
HP Z420 Memory Configurations and Optimization: Supported Memory Modules Memory Features
No ratings yet
HP Z420 Memory Configurations and Optimization: Supported Memory Modules Memory Features
2 pages
DSA Practical Program
No ratings yet
DSA Practical Program
37 pages
Lesson 4 - Indexing
No ratings yet
Lesson 4 - Indexing
6 pages
Configure DHCP Option 43
No ratings yet
Configure DHCP Option 43
15 pages
Create Table SQL 1
No ratings yet
Create Table SQL 1
11 pages
Delphi and Unicode 2013
No ratings yet
Delphi and Unicode 2013
29 pages
PSNR-Scope of Validity of PSNR in Quality
No ratings yet
PSNR-Scope of Validity of PSNR in Quality
2 pages
Tomorrow I LL Learn PHP Global Variables. This Is A Bad Command: Del C: .
No ratings yet
Tomorrow I LL Learn PHP Global Variables. This Is A Bad Command: Del C: .
5 pages
Factless Fact Table
No ratings yet
Factless Fact Table
5 pages
Fortinet Product Matrix
No ratings yet
Fortinet Product Matrix
6 pages
Do You Speak Java
No ratings yet
Do You Speak Java
186 pages
B Tree
No ratings yet
B Tree
6 pages
41 Questions To Test Your Knowledge of Python Strings
No ratings yet
41 Questions To Test Your Knowledge of Python Strings
14 pages
What Is A Digital Certificate?: Data Leaks
No ratings yet
What Is A Digital Certificate?: Data Leaks
5 pages
Amwell Standard Protocol V2.02!14!11 - 11
No ratings yet
Amwell Standard Protocol V2.02!14!11 - 11
33 pages
Zohirul A. Islam: (929) 483-6475 - Linux Engineer - Devops - Site Reliability Engineer
No ratings yet
Zohirul A. Islam: (929) 483-6475 - Linux Engineer - Devops - Site Reliability Engineer
3 pages
DBMS Unit1 Notes
No ratings yet
DBMS Unit1 Notes
40 pages
GM67_Barcode
No ratings yet
GM67_Barcode
84 pages