Implementing Web Scraping in Python With Beautifulsoup
Implementing Web Scraping in Python With Beautifulsoup
BeautifulSoup
Difficulty Level : Medium
URL = "https://www.geeksforgeeks.org/data-structures/"
r = requests.get(URL)
print(r.content)
import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install
html5lib
print(soup.prettify())
A really nice thing about the BeautifulSoup library is that it is built on the top of the
HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object
and specify the parser library can be created at the same time.
In the example above,
soup = BeautifulSoup(r.content, 'html5lib')
We create a BeautifulSoup object by passing two arguments:
r.content : It is the raw HTML content.
html5lib : Specifying the HTML parser we want to use.
Now soup.prettify() is printed, it gives the visual representation of the parse tree created
from the raw HTML content.
Step 4: Searching and navigating through the parse tree
Now, we would like to extract some useful data from the HTML content. The soup object
contains all the data in the nested structure which could be programmatically extracted. In
our example, we are scraping a webpage consisting of some quotes. So, we would like to
create a program to save those quotes (and all relevant information about them).
import requests
import csv
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quotes.append(quote)
filename = 'inspirational_quotes.csv'
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
Before moving on, we recommend you to go through the HTML content of the webpage
which we printed using soup.prettify() method and try to find a pattern or a way to
navigate to the quotes.
It is noticed that all the quotes are inside a div container whose id is
‘all_quotes’. So, we find that div element (termed as table in above code)
using find() method :
table = soup.find('div', attrs = {'id':'all_quotes'})
The first argument is the HTML tag you want to search and second argument is
a dictionary type element to specify the additional attributes associated with
that tag. find() method returns the first matching element. You can try to
print table.prettify() to get a sense of what this piece of code does.
Now, in the table element, one can notice that each quote is inside a div
container whose class is quote. So, we iterate through each div container whose
class is quote.
Here, we use findAll() method which is similar to find method in terms of
arguments but it returns a list of all matching elements. Each quote is now
iterated using a variable called row.
Here is one sample row HTML content for better understanding: