Sma 2

Name: Karthik Iyer, Anusha Maniar, Darsh Shah, Hrimkar Doshi
UID:
2021700028, 2021700038, 2021700057, 2021700021
Experiment No. 2
PROBLEM
Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn,
STATEMENT : YouTube, Web blogs etc) ,connect to and capture social media data for business ( scraping,
crawling, parsing).
THEORY:
1. Scraping
Theory: Scraping, or web scraping, refers to the automated process of extracting data from
websites. This technique involves sending requests to a website and parsing the HTML content
to retrieve specific pieces of information.
Example: Suppose you want to collect the latest headlines from a news website. You can use a
scraping tool to visit the site, parse the HTML, and extract the text of the headlines.
Working:
● Request: A request is sent to the website's server using libraries like requests in
Python.
● Response: The server returns the HTML content of the page.
● Parsing: Libraries like BeautifulSoup or lxml are used to parse the HTML and
extract the desired data, such as tags containing headlines, prices, or other elements.
2. Crawling
Theory: Crawling is the process of systematically navigating through multiple pages of a

website to collect data. Crawlers, also known as spiders, start from a set of URLs (known as
seeds) and follow links on those pages to discover and retrieve new content.
Example: A search engine like Google uses crawlers to visit billions of web pages, following
links from one page to another to index the content for search results.
Working:
● Seed URLs: The crawler starts with a list of URLs.

● Link Extraction: It visits each URL, extracts links from the page, and adds them to the
list of URLs to visit.
● Data Collection: As it visits each page, the crawler collects data, such as text, images,
or metadata, and stores it in a database.
3. Parsing
Theory: Parsing is the process of analyzing a string of symbols, either in natural language, code,
or data formats, to understand its structure and extract meaningful information. In web data
collection, parsing typically refers to the extraction and processing of data from HTML, XML,
or JSON formats.
Example: When you receive JSON data from an API, you need to parse it to extract specific
fields, such as a video title or number of views from YouTube's API response.
Working:
● HTML Parsing: Tools like BeautifulSoup are used to parse HTML and extract
specific tags or attributes.
● JSON Parsing: JSON data can be parsed using native methods in most programming
languages. In Python, json.loads() is used to convert a JSON string into a
dictionary.
● XML Parsing: Similar to HTML, XML data can be parsed to extract structured
information, using libraries like ElementTree.
Practical Use Case in Business:
● Scraping: Collect pricing information from competitors' websites to adjust your pricing
strategy.
● Crawling: Gather product reviews from multiple e-commerce sites to analyze customer
sentiment.
● Parsing: Extract and analyze data from an API to monitor real-time changes in market
trends.
CODE:
import requests
import json
# YouTube API endpoint
api_url = 'https://www.googleapis.com/youtube/v3/videos'
# Requesting multiple parts: snippet, contentDetails, statistics,

status, and more
params = {
'part': 'snippet,contentDetails,statistics,status',
'id': 'hKR57pX7-fY',
'key': 'AIzaSyAUmLri_CuNCZN5toPEzDX6gfk5yRwL2TM'
response = requests.get(api_url, params=params)
data = response.json()
# Print all the data retrieved for inspection
print(json.dumps(data, indent=4))
if 'items' in data and len(data['items']) > 0:
video_data = data['items'][0]
# Extracting various data points
title = video_data['snippet']['title']
description = video_data['snippet']['description']
published_at = video_data['snippet']['publishedAt']
channel_title = video_data['snippet']['channelTitle']
tags = video_data['snippet'].get('tags', [])
duration = video_data['contentDetails']['duration']
definition = video_data['contentDetails']['definition']
caption_status = video_data['contentDetails']['caption']
views = video_data['statistics'].get('viewCount')
likes = video_data['statistics'].get('likeCount')
dislikes = video_data['statistics'].get('dislikeCount', 'N/A')

# Dislike count might not be available anymore
comment_count = video_data['statistics'].get('commentCount')
privacy_status = video_data['status']['privacyStatus']
license_status = video_data['status']['license']
# Print out all the collected data
print(f"Title: {title}")
print(f"Description: {description}")
print(f"Published At: {published_at}")
print(f"Channel Title: {channel_title}")
print(f"Tags: {tags}")
print(f"Duration: {duration}")
print(f"Definition: {definition}")
print(f"Caption Status: {caption_status}")
print(f"Views: {views}")
print(f"Likes: {likes}")
print(f"Dislikes: {dislikes}")
print(f"Comment Count: {comment_count}")

print(f"Privacy Status: {privacy_status}")
print(f"License Status: {license_status}")
else:
print("No video data found or invalid video ID.")
OUTPUT: {
"kind": "youtube#videoListResponse",
"etag": "fW5nFjrg_8GRFUUSXOCOQIzthQ0",
"items": [
{
"kind": "youtube#video",
"etag": "9PKFxU5RVlM5ljTxq7JXZPdyu_Q",
"id": "hKR57pX7-fY",
"snippet": {
"publishedAt": "2024-08-21T13:00:58Z",
"channelId": "UCtxD0x6AuNNqdXO9Wp5GHew",
"title": "This is how I manage the pressure of being Cristiano",
"description": "Cristiano explains what it's like to be the most watched person
in the world and what that responsibility entails",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/hqdefault.jpg",
"width": 480,
"height": 360
},
"standard": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/sddefault.jpg",
"width": 640,
"height": 480
},
"maxres": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/maxresdefault.jpg",
"width": 1280,
"height": 720
}
},
"channelTitle": "UR \u00b7 Cristiano",
"categoryId": "22",
"liveBroadcastContent": "none",
"localized": {
"title": "This is how I manage the pressure of being Cristiano",
"description": "Cristiano explains what it's like to be the most watched
person in the world and what that responsibility entails"
},
"defaultAudioLanguage": "pt"
},
"contentDetails": {
"duration": "PT1M9S",
"dimension": "2d",
"definition": "hd",
"caption": "true",
"licensedContent": true,
"contentRating": {},
"projection": "rectangular"
},
"status": {
"uploadStatus": "processed",
"privacyStatus": "public",
"license": "youtube",
"embeddable": true,
"publicStatsViewable": true,
"madeForKids": false
},
"statistics": {
"viewCount": "3835717",
"likeCount": "643439",
"favoriteCount": "0",
"commentCount": "20451"
}
}
],
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
}
}
Title: This is how I manage the pressure of being Cristiano
Description: Cristiano explains what it's like to be the most watched person in the world
and what that responsibility entails
Published At: 2024-08-21T13:00:58Z
Channel Title: UR · Cristiano
Tags: []
Duration: PT1M9S
Definition: hd
Caption Status: true
Views: 3835717
Likes: 643439
Dislikes: N/A
Comment Count: 20451
Privacy Status: public
License Status: youtube
The experiment effectively demonstrated the process of collecting data from various social
CONCLUSION:
media platforms for business analysis. Despite technical challenges, the ability to capture and
analyze diverse data types provides valuable insights for informed decision-making and strategic
planning.

Sma 2

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Sma 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sma 2

Uploaded by

Copyright:

Available Formats

Name: Karthik Iyer, Anusha Maniar, Darsh Shah, Hrimkar Doshi

Theory: Crawling is the process of systematically navigating through multiple pages of a

● Seed URLs: The crawler starts with a list of URLs.

# YouTube API endpoint

# Requesting multiple parts: snippet, contentDetails, statistics,

response = requests.get(api_url, params=params)

# Print all the data retrieved for inspection

if 'items' in data and len(data['items']) > 0:

# Extracting various data points

dislikes = video_data['statistics'].get('dislikeCount', 'N/A')

# Print out all the collected data

print(f"Published At: {published_at}")

print(f"Channel Title: {channel_title}")

print(f"Caption Status: {caption_status}")

print(f"Comment Count: {comment_count}")

print(f"License Status: {license_status}")

print("No video data found or invalid video ID.")

You might also like

Sma 2

Uploaded by

Document Informationclick to expand document informationDifferent social media analytic

Document Informationclick to expand document information

Copyright:

Available Formats

Sma 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sma 2

Uploaded by

Copyright:

Available Formats

Name: Karthik Iyer, Anusha Maniar, Darsh Shah, Hrimkar Doshi

Theory: Crawling is the process of systematically navigating through multiple pages of a

● Seed URLs: The crawler starts with a list of URLs.

# YouTube API endpoint

# Requesting multiple parts: snippet, contentDetails, statistics,

response = requests.get(api_url, params=params)

# Print all the data retrieved for inspection

if 'items' in data and len(data['items']) > 0:

# Extracting various data points

dislikes = video_data['statistics'].get('dislikeCount', 'N/A')

# Print out all the collected data

print(f"Published At: {published_at}")

print(f"Channel Title: {channel_title}")

print(f"Caption Status: {caption_status}")

print(f"Comment Count: {comment_count}")

print(f"License Status: {license_status}")

print("No video data found or invalid video ID.")

You might also like