0% found this document useful (0 votes)
5 views9 pages

Sma 2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

Name: Karthik Iyer, Anusha Maniar, Darsh Shah, Hrimkar Doshi

UID:
2021700028, 2021700038, 2021700057, 2021700021

Experiment No. 2

PROBLEM
Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn,
STATEMENT : YouTube, Web blogs etc) ,connect to and capture social media data for business ( scraping,
crawling, parsing).

THEORY:
1. Scraping

Theory: Scraping, or web scraping, refers to the automated process of extracting data from
websites. This technique involves sending requests to a website and parsing the HTML content
to retrieve specific pieces of information.

Example: Suppose you want to collect the latest headlines from a news website. You can use a
scraping tool to visit the site, parse the HTML, and extract the text of the headlines.

Working:

● Request: A request is sent to the website's server using libraries like requests in
Python.
● Response: The server returns the HTML content of the page.
● Parsing: Libraries like BeautifulSoup or lxml are used to parse the HTML and
extract the desired data, such as tags containing headlines, prices, or other elements.
2. Crawling

Theory: Crawling is the process of systematically navigating through multiple pages of a


website to collect data. Crawlers, also known as spiders, start from a set of URLs (known as
seeds) and follow links on those pages to discover and retrieve new content.

Example: A search engine like Google uses crawlers to visit billions of web pages, following
links from one page to another to index the content for search results.

Working:

● Seed URLs: The crawler starts with a list of URLs.


● Link Extraction: It visits each URL, extracts links from the page, and adds them to the
list of URLs to visit.
● Data Collection: As it visits each page, the crawler collects data, such as text, images,
or metadata, and stores it in a database.

3. Parsing

Theory: Parsing is the process of analyzing a string of symbols, either in natural language, code,
or data formats, to understand its structure and extract meaningful information. In web data
collection, parsing typically refers to the extraction and processing of data from HTML, XML,
or JSON formats.

Example: When you receive JSON data from an API, you need to parse it to extract specific
fields, such as a video title or number of views from YouTube's API response.

Working:

● HTML Parsing: Tools like BeautifulSoup are used to parse HTML and extract
specific tags or attributes.
● JSON Parsing: JSON data can be parsed using native methods in most programming
languages. In Python, json.loads() is used to convert a JSON string into a
dictionary.
● XML Parsing: Similar to HTML, XML data can be parsed to extract structured
information, using libraries like ElementTree.
Practical Use Case in Business:

● Scraping: Collect pricing information from competitors' websites to adjust your pricing
strategy.
● Crawling: Gather product reviews from multiple e-commerce sites to analyze customer
sentiment.
● Parsing: Extract and analyze data from an API to monitor real-time changes in market
trends.

CODE:
import requests

import json

# YouTube API endpoint

api_url = 'https://www.googleapis.com/youtube/v3/videos'

# Requesting multiple parts: snippet, contentDetails, statistics,


status, and more

params = {

'part': 'snippet,contentDetails,statistics,status',

'id': 'hKR57pX7-fY',

'key': 'AIzaSyAUmLri_CuNCZN5toPEzDX6gfk5yRwL2TM'

response = requests.get(api_url, params=params)

data = response.json()

# Print all the data retrieved for inspection

print(json.dumps(data, indent=4))

if 'items' in data and len(data['items']) > 0:

video_data = data['items'][0]

# Extracting various data points

title = video_data['snippet']['title']

description = video_data['snippet']['description']

published_at = video_data['snippet']['publishedAt']

channel_title = video_data['snippet']['channelTitle']
tags = video_data['snippet'].get('tags', [])

duration = video_data['contentDetails']['duration']

definition = video_data['contentDetails']['definition']

caption_status = video_data['contentDetails']['caption']

views = video_data['statistics'].get('viewCount')

likes = video_data['statistics'].get('likeCount')

dislikes = video_data['statistics'].get('dislikeCount', 'N/A')


# Dislike count might not be available anymore

comment_count = video_data['statistics'].get('commentCount')

privacy_status = video_data['status']['privacyStatus']

license_status = video_data['status']['license']

# Print out all the collected data

print(f"Title: {title}")

print(f"Description: {description}")

print(f"Published At: {published_at}")

print(f"Channel Title: {channel_title}")

print(f"Tags: {tags}")

print(f"Duration: {duration}")

print(f"Definition: {definition}")

print(f"Caption Status: {caption_status}")

print(f"Views: {views}")

print(f"Likes: {likes}")

print(f"Dislikes: {dislikes}")

print(f"Comment Count: {comment_count}")


print(f"Privacy Status: {privacy_status}")

print(f"License Status: {license_status}")

else:

print("No video data found or invalid video ID.")

OUTPUT: {
"kind": "youtube#videoListResponse",
"etag": "fW5nFjrg_8GRFUUSXOCOQIzthQ0",
"items": [
{
"kind": "youtube#video",
"etag": "9PKFxU5RVlM5ljTxq7JXZPdyu_Q",
"id": "hKR57pX7-fY",
"snippet": {
"publishedAt": "2024-08-21T13:00:58Z",
"channelId": "UCtxD0x6AuNNqdXO9Wp5GHew",
"title": "This is how I manage the pressure of being Cristiano",
"description": "Cristiano explains what it's like to be the most watched person
in the world and what that responsibility entails",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/hqdefault.jpg",
"width": 480,
"height": 360
},
"standard": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/sddefault.jpg",
"width": 640,
"height": 480
},
"maxres": {
"url": "https://i.ytimg.com/vi/hKR57pX7-fY/maxresdefault.jpg",
"width": 1280,
"height": 720
}
},
"channelTitle": "UR \u00b7 Cristiano",
"categoryId": "22",
"liveBroadcastContent": "none",
"localized": {
"title": "This is how I manage the pressure of being Cristiano",
"description": "Cristiano explains what it's like to be the most watched
person in the world and what that responsibility entails"
},
"defaultAudioLanguage": "pt"
},
"contentDetails": {
"duration": "PT1M9S",
"dimension": "2d",
"definition": "hd",
"caption": "true",
"licensedContent": true,
"contentRating": {},
"projection": "rectangular"
},
"status": {
"uploadStatus": "processed",
"privacyStatus": "public",
"license": "youtube",
"embeddable": true,
"publicStatsViewable": true,
"madeForKids": false
},
"statistics": {
"viewCount": "3835717",
"likeCount": "643439",
"favoriteCount": "0",
"commentCount": "20451"
}
}
],
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
}
}
Title: This is how I manage the pressure of being Cristiano
Description: Cristiano explains what it's like to be the most watched person in the world
and what that responsibility entails
Published At: 2024-08-21T13:00:58Z
Channel Title: UR · Cristiano
Tags: []
Duration: PT1M9S
Definition: hd
Caption Status: true
Views: 3835717
Likes: 643439
Dislikes: N/A
Comment Count: 20451
Privacy Status: public
License Status: youtube

The experiment effectively demonstrated the process of collecting data from various social
CONCLUSION:
media platforms for business analysis. Despite technical challenges, the ability to capture and
analyze diverse data types provides valuable insights for informed decision-making and strategic
planning.

You might also like