0% found this document useful (0 votes)
493 views

Scraping Instagram With Python

This document provides a step-by-step process to scrape Instagram images and data using Selenium and Beautiful Soup in Python. It describes how to: 1. Import necessary libraries and open a web browser using Selenium to a public Instagram profile or hashtag page. 2. Parse the HTML source code of the page using Beautiful Soup to extract image links from posts. 3. Get additional post details like user, likes, comments by opening each image link and extracting JSON data to a Pandas dataframe. 4. Download the images to a local directory using the links and shortcodes from the dataframe.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
493 views

Scraping Instagram With Python

This document provides a step-by-step process to scrape Instagram images and data using Selenium and Beautiful Soup in Python. It describes how to: 1. Import necessary libraries and open a web browser using Selenium to a public Instagram profile or hashtag page. 2. Parse the HTML source code of the page using Beautiful Soup to extract image links from posts. 3. Get additional post details like user, likes, comments by opening each image link and extracting JSON data to a Pandas dataframe. 4. Download the images to a local directory using the links and shortcodes from the dataframe.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Scraping Instagram with python

(using Selenium and Beautiful


Soup)
This article is about how to scrape Instagram to download
images/get information on posts from a public profile page or a
hashtag. The code uses both selenium and beautiful soup to scrape
Instagram images without much of a hassle of providing account
details or any authentication tokens.

1. Import dependencies

Pip Install selenium and download chrome driver from the


following link http://chromedriver.chromium.org/

from selenium import webdriver


from bs4 import BeautifulSoup as bs
import time
import re
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd, numpy as np

2. Open the web browser: Selenium uses chrome driver to


open the profile given a username (public user). For example -

username='pickuplimes
browser = webdriver.Chrome('/path/to/chromedriver')
browser.get('https://www.instagram.com/'+username+'/?hl=en')
Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
If you want to open a hashtag page -

hashtag='food'
browser = webdriver.Chrome('/path/to/chromedriver')
browser.get('https://www.instagram.com/explore/tags/'+hashtag)
Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight);")

3. Parse HTML source page: Open the source page and use
beautiful soup to parse it. Go through the body of html script and
extract link for each image in that page and pass it to an empty list
‘links[]’.

links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))

Remember by default selenium opens only first page. If you want


to scroll through further pages and get more images divide the
scroll Height by a number and run the parse code multiple times.
This adds new links from each page to the list. For example -

Pagelength = browser.execute_script("window.scrollTo(0,
document.body.scrollHeight/1.5);")
links=[]
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))
#sleep time is required. If you don't use this Instagram may interrupt the script and
doesn't scroll through pages
time.sleep(5)
Pagelength = browser.execute_script("window.scrollTo(document.body.scrollHeight/1.5,
document.body.scrollHeight/3.0);")
source = browser.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('span')
for link in script.findAll('a'):
if re.match("/p", link.get('href')):
links.append('https://www.instagram.com'+link.get('href'))

This may not be efficient way to scroll pages. I haven’t tried other
methods but you can check using end_cursor and has_next_page
= True or False and loop through it.

4. Get information for each image in the page: To get more


details of each image like who posted it, post type, image url,
image catpion, number of likes and comments etc. open the source
page of each image (from ‘links’ list in previous code) and extract
the JSON script to pandas dataframe.

result=pd.DataFrame()
for i in range(len(links)):
try:
page = urlopen(links[i]).read()
data=bs(page, 'html.parser')
body = data.find('body')
script = body.find('script')
raw = script.text.strip().replace('window._sharedData =', '').replace(';', '')
json_data=json.loads(raw)
posts =json_data['entry_data']['PostPage'][0]['graphql']
posts= json.dumps(posts)
posts = json.loads(posts)
x = pd.DataFrame.from_dict(json_normalize(posts), orient='columns')
x.columns = x.columns.str.replace("shortcode_media.", "")
result=result.append(x)

except:
np.nan
Just check for the duplicates
result = result.drop_duplicates(subset = 'shortcode')
result.index = range(len(result.index))

The columns you get might be slightly different for user profile
page and hashtag page. Checkout the columns and filter whatever
you need.

5. Download images from pandas data frame: Use requests


library to download images from the ‘display_url’ in pandas
‘result’ data frame and store them with respective shortcode as file
name.

(Important Note: Remember that you should respect author’s


rights when you download copyrighted content. Do not use
images/videos from Instagram for commercial intent).

import os
import requests
result.index = range(len(result.index))
directory="/directory/you/want/to/save/images/"
for i in range(len(result)):
r = requests.get(result['display_url'][i])
with open(directory+result['shortcode'][i]+".jpg", 'wb') as f:
f.write(r.content)

Thanks for reading and I hope you find this article useful. If you
have any questions, I’d be more than happy to discuss.

You might also like