Final 011

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

EVEREST ENGINEERING COLLEGE

(AFFILIATED TO POKHARA UNIVERSITY)

Minor Project Report

on

SENTIMENTAL ANALYSIS OG YOUTUBE VIDEO COMMENTS

Submitted By:

Amir Arshad Khan 21120057


Dipendra Raj Bhatt 21120077
Prabesh Shrestha 21120103
Prajwal Acharya 21120104

Submitted To:

Department of Computer and IT Engineering

Everest Engineering College

Sanepa-2 Lalitpur

23 September, 2024
DECLARATION

We hereby declare that the report of the project entitled “Sentimental Analysis of
YouTube Video Comments” which is being submitted to the Department of Computer
and Information Technology Engineering, Everest Engineering College, Sanepa-2
Lalitpur, in the partial fulfillment of the requirements for the award of the Degree of
Bachelor of Engineering in Computer Engineering, is a bona fide report of the work carried
out by us. We are responsible for the work submitted in this project, that the original work
is our own except as specified in the references and acknowledgements, and that the
original work contained herein have not been undertaken or done by unspecified sources
or persons.

Amir Arshad Khan [21120057]


Dipendra Raj Bhatt [21120077]
Prabesh Shrestha [21120103]
Prajwal Acharya [21120104]
CERTIFICATE OF APPROVAL

The project report entitled “Sentimental Analysis of YouTube Video Comments”,


submitted by Amir Arshad Khan, Dipendra Raj Bhatt, Prabesh Shrestha and Prajwal
Acharya in partial fulfillment of the requirement for the Bachelor’s degree in Information
Technology Engineering has been accepted as a Bonafede record of work independently
carried out by the group in the department.

……………………………..

External Examiner
Name: ……………………………….
Designation: …………………………
Affiliation: ………………………….
……………………………………….
PROJECT COMMITTEE
Department of Computer and IT Engineering
Everest Engineering College, Sanepa
COPYRIGHT

The author has agreed that the library, Everest Engineering College (EEC), Sanepa-2,
Lalitpur may make this report freely available for inspection. Moreover, the author has
agreed that permission for extensive copying of this project report for scholarly purpose
may be granted by the lecturers, who supervised the project works recorded herein or, in
their absence, by the Head of Department wherein the project report was done. It is
understood that the recognition will be given to the author of the report and to the
Department of Information Technology, Engineering EEC in any use of the material of this
project report. Copying or publication or other use of this report for financial gain without
approval of the Department and author’s written permission is prohibited. Request for
permission to copy or to make any other use of the material in this report in whole or in
part should be addressed to the Head of Department, Department of Computer and
Information Technology Engineering.
ACKNOWLEDGEMENT

The success of this project required a lot of guidance and assistance from many people and
we are extremely fortunate to have got this all along the proposal defense of third year
project work. Whatever we have done is only due to such guidance and assistance. Firstly,
we would like to thank Pokhara University for including the third year project as a part of
our curriculum. We would like to express our sincere gratitude to our project committee
members for their guidance and support throughout the development of this project. Most
importantly, we would like to thank our project committee, their expertise and insights
were invaluable in selecting and creating the system.

i
ABSTRACT

This project focuses on the sentiment analysis of YouTube video comments to enhance
video recommendations and improve user experience. YouTube's vast and diverse content
presents a challenge in maintaining quality and relevance. To address this, sentiment
analysis is employed to classify comments as positive, negative, or neutral, providing
valuable feedback for categorizing videos and refining the recommendation system. The
analysis utilizes both Naive Bayes and Logistic Regression classifiers. Specifically, the
Multinomial Naive Bayes Classifier, known for its effectiveness in text classification, was
implemented alongside Unigram features to capture the context and sentiment of
comments. This approach involved preprocessing text data through tokenization, stop-
word removal, and stemming to prepare it for classification. The Naive Bayes classifier,
trained with Unigram features, was used to predict the sentiment of comments, while
Logistic Regression was applied to further validate the results. By accurately identifying
viewer sentiment, the system improves the recommendation engine by promoting videos
with positive feedback and filtering out those with negative responses. This benefits both
content creators, who gain insights into audience perception, and viewers, who receive
more relevant and engaging content. Ultimately, the project enhances user satisfaction by
ensuring a more personalized and trustworthy platform experience.

Keywords: Sentiment Analysis, YouTube Comments, Naive Bayes Classifier, Logistic


Regression, Multinomial Naive Bayes, Unigram, Text Classification, Viewer Feedback,
Content Quality, Natural Language Processing (NLP).

ii
Table of Contents
ACKNOWLEDGEMENT ................................................................................................... i
ABSTRACT ...................................................................................................................... ii
List of Figures .................................................................................................................... iv
List of Abbreviation ............................................................................................................ v
Chapter 1 INTRODUCTION....................................................................................... 1
1.1 Introduction .......................................................................................................... 1
1.2 Problem Statement ............................................................................................... 2
1.3 Objectives ............................................................................................................. 2
1.4 Scope and Applications ........................................................................................ 3
Chapter 2 LITERATURE REVIEW .......................................................................... 4
Chapter 3 METHODOLOGY ..................................................................................... 7
3.1 System Block Diagram......................................................................................... 7
3.2 Hardware and software required .......................................................................... 9
3.3 Data Collection ..................................................................................................... 9
3.4 Pre- processing ................................................................................................... 10
3.5 Feature Extraction .............................................................................................. 12
3.6 Model Implementation ....................................................................................... 13
3.7 Evaluating and Tune Model ............................................................................... 15
3.8 Presentation of Output ........................................................................................ 16
3.9 Implementation Plan .......................................................................................... 17
3.10 Problems Faced .................................................................................................. 19
Chapter 4 RESULT AND ANALYSIS ...................................................................... 21
4.1 Results ................................................................................................................ 21
4.2 Performance Metrics .......................................................................................... 23
Chapter 5 CONCLUSION ......................................................................................... 25
REFERENCES ................................................................................................................ 26
APPENDIX ..................................................................................................................... 27

iii
List of Figures

Figure 3-1 System Flow Diagram .........................................................................................................7


Figure 3-2 Sentiment analysis implementation.................................................................................8

iv
List of Abbreviation

IDF Inverse Document Frequency


NLP Natural Language Processing
RNNs Recurrent Neural Networks
SVM Support Vector Machine
TF Term Frequency

v
Chapter 1 INTRODUCTION

1.1 Introduction
YouTube is one of the most popular platforms for sharing and viewing videos, with over 2
billion active users every month. It offers content ranging from tutorials to entertainment,
and its open platform allows anyone to upload videos and share their ideas with the world.
However, maintaining the quality and reliability of content is a challenge due to the vast
amount of content being uploaded.

This project focuses on analyzing YouTube comments using sentiment analysis to


categorize videos based on viewer feedback. Sentiment analysis, a technique from NLP,
identifies whether the sentiment behind a comment is positive, negative, or neutral. By
applying sentiment analysis, YouTube can improve its recommendation system and
address some of the platform's content quality issues.

Sentiment analysis has evolved over the years, starting in the early 2000s with research
into understanding emotional tones in product reviews and online discussions. As machine
learning and NLP techniques developed, sentiment analysis became more sophisticated
and is now used across various platforms, including social media, product feedback, and
video comments.

Sentiment analysis scans comments to identify negative reactions. When a video receives
a large number of negative comments, it may indicate that the video is misleading or of
low quality. Based on this, YouTube can adjust its recommendation algorithm to stop
promoting such videos. For example, if comments frequently mention terms like
"clickbait" or "disappointing," sentiment analysis can flag the video as untrustworthy.
Sentiment analysis can detect patterns in spam or irrelevant comments by analyzing
language use and tone. Spam comments that repeat generic phrases or promote unrelated
content can be identified and filtered out. For instance, comments like "Great video!"
posted repetitively across many unrelated videos could be marked as spam. By
understanding the emotional tone of comments (whether they are positive, negative, or

1
neutral), sentiment analysis can improve YouTube’s recommendation system. Videos with
mostly positive comments can be promoted, while those with mixed or negative feedback
can be demoted, helping users discover more relevant content.

For example, videos with comments such as "Loved this video!" or "Very helpful" indicate
high engagement and quality, making them more suitable for recommendation.

Thus, this project accurately captures the sentiments of viewers regarding YouTube video
content, providing valuable insights for both content creators and new viewers. It helps
creators understand what type of content resonates well with the audience, while also
guiding viewers towards higher-quality videos. This ensures clarity on the kinds of content
that should be created and consumed, ultimately benefiting the entire YouTube community.

1.2 Problem Statement


With the vast amount of user-generated content on YouTube, understanding viewer
opinions and reactions has become a significant challenge. Traditional metrics such as
views, likes, and dislikes provide limited insight into the true sentiments of viewers. This
lack of in-depth understanding hampers content creators, advertisers, and platform
moderators in making informed decisions. Negative sentiments often go unnoticed, which
can lead to the spread of misleading information, inappropriate content, or reduced
engagement.

The problem is to develop an effective system that can accurately analyze and categorize
the sentiments expressed in YouTube video comments as positive, negative, or neutral.
This system should be able to process large volumes of data efficiently, providing real-
time insights to help creators tailor their content strategies, improve audience engagement,
and assist moderators in identifying and addressing potential issues within the community.
Ultimately, this project aims to enhance the overall quality of content and interactions on
the YouTube platform through data-driven sentiment analysis.

1.3 Objectives
To analyze the sentiment of YouTube video comments and differentiates the category for
positive and negative.

2
1.4 Scope and Applications
This project focuses on analyzing viewer sentiments in YouTube video comments. By
categorizing comments as positive, negative, or neutral using sentiment analysis
techniques, it provides insights into how videos are perceived by the audience. The system
efficiently processes large volumes of comments, helping content creators understand
viewer feedback and improve their content strategies. Additionally, it can identify videos
with predominantly negative sentiments, which may indicate low-quality or misleading
content, allowing for better moderation and content management.

The application of this project spans several areas on the YouTube platform. For content
creators, sentiment analysis provides a clearer picture of how their content is received,
allowing them to adjust their approach based on feedback. For viewers, it enhances the
recommendation engine by promoting videos with positive feedback and reducing
exposure to negatively reviewed content. This project also benefits advertisers, as it enables
better ad targeting by analyzing viewer sentiments to match ads with appropriate content.
Moreover, it can be integrated into YouTube’s moderation system to detect and filter out
spam, misinformation, and irrelevant comments, ensuring a cleaner and more trustworthy
interaction environment.

3
Chapter 2 LITERATURE REVIEW

Text classification is a widely used task in NLP. It has many applications, including
categorizing customer feedback, also filtering spam comments. To develop sentiment
analysis, a large dataset is used. This dataset includes positive and negative sentiments.
The goal is to separate the comments based on these datasets. Datasets offer extensive
functionality for manipulating data, but it can be useful to convert a Dataset object to a
Pandas Data Frame for easier data visualization.

Sentiment analysis, also known as opinion mining, is a crucial component of natural


language processing (NLP) that involves determining the emotional tone behind a series of
words. Its importance lies in its ability to extract and understand subjective information
from text, which can be instrumental in various applications. In the context of YouTube
comments, sentiment analysis becomes particularly valuable due to the vast amount of
user-generated content and its implications for content creators, marketers, and platform
administrators [1].

YouTube comments are a rich source of feedback and opinions on videos, products, and
services. Analyzing these comments allows for insights into user sentiments, trends, and
reactions to content. This information is essential for improving content strategies,
moderating harmful interactions, and enhancing user engagement. Given the sheer volume
of comments, manual analysis is impractical, making automated sentiment analysis a
necessity for deriving actionable insights. Sentiment analysis encompasses a range of
methodologies, which can be broadly categorized into traditional rule-based systems and
modern machine learning (ML) and deep learning approaches. Rule-Based Systems:
Traditional sentiment analysis methods rely on predefined rules and lexicons to evaluate
sentiment. These systems use lists of words associated with positive or negative sentiments
and apply grammatical rules to infer the overall sentiment of a text. While straightforward,
rule-based systems often struggle with nuances such as context and sarcasm.

Machine Learning Approaches: Modern sentiment analysis frequently employs machine


learning techniques, which involve training algorithms on labeled datasets to predict

4
sentiment. Naive Bayes is a probabilistic classifier based on Bayes' theorem, which
assumes feature independence. Despite its simplicity, Naive Bayes performs well in text
classification tasks [2].

Previous work by Siersdorfer et al. on YouTube analyzed over 6 million comments from
67,000 videos. They built prediction models to forecast the rating of new comments based
on previous user interactions. Pang, Lee, and Vaithyanathan performed sentiment analysis
on movie reviews, using ML algorithms such as Naïve Bayes and Support Vector Machines
(SVM). They showed that machine learning outperforms manual classification, but
sentiment analysis remains a complex problem due to conflicting positive and negative
expressions in the text. Smita Shree and Josh Brolin adopted a lexicon-based approach to
identify sentiment polarity in YouTube comments. However, their study revealed a lower
recall rate for negative sentiments due to linguistic variations in expressing dissatisfaction.
Other works, such as those by A. Kowcika et al., demonstrated how sentiment analysis on
Twitter could reveal a correlation between individual moods and real-world events. The
authors utilized features such as lemmatization, tokenization, n-grams, and stop word
removal to enhance the model's performance. By applying these techniques, the system
improved the accuracy of sentiment classification [3].

Sentiment analysis has been traditionally applied to various social media platforms such as
Twitter, Facebook, and blogs. YouTube, being one of the largest video-sharing platforms
globally, offers a massive dataset of comments reflecting users’ opinions on diverse topics.
The sentiment analysis of YouTube video comments presents unique challenges due to the
informal, unstructured, and often noisy nature of the data. Several research efforts have
been made to explore sentiment in YouTube comments to understand viewer reactions,
brand perception, and user engagement. Studies have demonstrated the use of machine
learning and natural language processing (NLP) techniques to classify and extract
sentiments from these comments. One of the primary algorithms used is the Naive Bayes
classifier due to its simplicity and effectiveness in text classification tasks. The article by
Analytics Vidhya highlights the significance of the Naive Bayes algorithm and its utility
in sentiment analysis tasks, including its use in YouTube comment analysis [4].

5
The Naive Bayes classifier is a probabilistic model used extensively in sentiment analysis.
The article from Analytics Vidhya (2022) explains how Naive Bayes can be built from
scratch and applied to sentiment analysis. The Naive Bayes model operates on the
assumption that each feature (word or token) contributes independently to the final
classification. Although this assumption does not hold in all real-world cases, it has proven
to be a robust and efficient model for text classification, especially in domains with a high
volume of text data such as YouTube comments. The steps involved in creating a Naive
Bayes sentiment analysis classifier include, Preprocessing: Cleaning the comments by
removing stop words, punctuation, and converting the text to lowercase. Feature
Extraction: Transforming the text into a feature vector using methods such as TF-IDF
(Term Frequency-Inverse Document Frequency) Model Building: Utilizing the Naive
Bayes classifier to train on labeled data (positive, negative, or neutral sentiments)
.Prediction and Evaluation: Testing the model on unseen data and evaluating its
performance using metrics like accuracy, precision, recall, and F1-score.

The article from Analytics Vidhya mentions the importance of preprocessing to improve
model performance. Handling issues like spelling mistakes, lemmatization, and stemming
plays a crucial role in preparing the text for analysis. Despite this, achieving high accuracy
in predicting sentiment remains challenging, and the use of more complex models like deep
learning or may be required to overcome these issues in future research. Several studies are
also exploring multimodal sentiment analysis, where comments are analyzed in
conjunction with video metadata, audio features, or visual cues to better understand overall
user sentiment. This area of research could significantly enhance the understanding of
sentiment in YouTube videos by providing a richer, more comprehensive analysis. Overall,
Naive Bayes provides a solid foundation for sentiment analysis, and its application to
YouTube comments remains an important area of research, contributing valuable insights
into public opinion on digital platforms [5].

6
Chapter 3 METHODOLOGY

3.1 System Block Diagram

Figure 3-1 System Flow Diagram

7
Figure 3-2 Sentiment analysis implementation

8
3.2 Hardware and software required
Software Requirement

Operating System: Window 11

Programming Language: Python

Libraries/Framework: pandas,streamlit,matplotlib,nltk,pickle,numpy,scikit-learn

Dependency management tool: pip

Web services/APIs: YouTube API

IDE: vs code, google colab,

Version control: Git 2.45.2,Github

Hardware requirement

Processor: Intel Core i5 or equivalent

RAM: 8GB

Storage: 256GB SSD

3.3 Data Collection:


In data extraction, we start by retrieving relevant information from various sources,
specially YouTube channels, where we gather comments, reviews, and transcript for
sentiment analysis. We initially extract this data by using YouTube API. This method helps
us navigate the YouTube site, identify comment sections, and extract text efficiently. We
store the data in CSV file, we parse these files to extract the necessary information. For
this, we read the file line by line, splitting each line into columns based on the delimiter (a
comma) to organize the data into rows and columns this involves writing scripts to read
and process the file contents. By employing these methods, we compile a dataset ready for
sentiment analysis, which helps us understand the public's sentiment towards a particular
video or topic. Moreover, we implement same process to extract the transcript for defining
the class of YouTube comment for better recommendation.

9
3.4 Pre- processing
In data preprocessing, we aim to transform raw data into a clean, structured, and suitable
format for analysis or model training. After data extraction, arranging the data into the
required format is necessary to use the algorithm effectively. For example, let the extracted
data be "I,’am a good boy!. It Doesn't mean everyone loves me." Here, the raw data is
messy, containing noise, inconsistencies, and irrelevant information that can negatively
impact the performance of analytical models.

Thus, the required input for Naive Bayes and KNN is in the form of cleaned and processed
text. First, we remove special characters and punctuation, leading to "I am a good boy, It
Doesn't mean everyone loves me." Next, we convert all text to lowercase for consistency:
"I am a good boy it doesn't mean everyone loves me." We then remove stopwords like
"am" and "a": "I good boy it doesn’t mean everyone loves me." Finally, we lemmatize the
words to their root forms: "I good boy it not mean everyone love me." This processed text
is now ready for further analysis using Naive Bayes, KNN, or other machine learning
algorithms. The brief description of the different steps involved is illustrated below

3.4.1 Data cleaning


In data cleaning, we remove HTML tags, punctuation, numbers, and other non-text
elements to ensure only the essential text remains. This step is crucial for eliminating noise
and inconsistencies that can negatively affect the analysis. For example, consider the
sentence: "I, ‘am a good boy! It Doesn’t mean everyone loves me." We begin by removing
special characters like punctuation, resulting “I am a good boy It Doesn’t mean everyone
loves me." This clean version of the text is now free from distractions and ready for further
processing.

3.4.2 Convert to Lower case


Our first step for sentiment analysis is to convert the text into lowercase to ensure
consistency and sensitivity in word representation. By transforming the text to lowercase,
we standardize the representation of words. This step is crucial because it treats "It" and
"it" as identical, eliminating case sensitivity during analysis. For instance, consider the
example: "I am a good boy, It Doesn't mean everyone loves me. He the output after lower
case conversion is “ I am a good boy, it doesn’t mean everyone loves me”

10
3.4.3 Tokenization
In tokenization, we split the text or sentence into individual words or tokens. This step is
crucial because it converts a string of text into a list of words, allowing for more granular
analysis. By breaking down the text into smaller units, we can analyze the frequency of
words, their relationships, and patterns within the text. This process involves identifying
the boundaries between words, punctuation, and other elements, effectively segmenting the
text into its basic components. Tokenization enables us to focus on individual words, which
is essential for sentiment analysis, where the presence or absence of specific words can
significantly impact the results. For instance, by tokenizing the sentence "I good boy it not
means everyone love me," we can isolate each word and analyze its role and importance in
the sentence, providing a foundation for further preprocessing steps like stopword removal
and lemmatization.

Tokenization process:

Original Text “I am a good boy it doesn’t mean everyone loves me”

Tokens["I", “am”, “a”, “good", "boy", "it", "doesn’t", "mean", "everyone", "love",
"me"]

Stopwords

Stopwords are common words that typically do not carry significant meaning on their own,
such as "and," "the," "is," and "in." Thus, removing stopwords helps to focus on the words
that are more meaningful and relevant to the sentiment analysis, improving the efficiency
and performance of the model. In case of Navie Bayes Classifier the removal of stopwords
is contradictory. Since, they are context dependent. For example, Probability of …

List of some stopwords

a, about, above, after, again, all, am, an, and, any, are, as, at, be, because, been, before,
being, below, between, both, but, by, can, could, did, do, does, doing, down, during, each,
few, for, from, further, had, has, have, he, her, here, hers, herself, him, himself, his, how,
I, if, in, into, is, it, its, itself, let's, me, more, most, my, myself, of, off, on, once, only, or,
other, our, ours, ourselves, out, over, own, same, should, so, some, such, than, that, the,

11
their, theirs, them, themselves, then, there, these, they, this, those, though, to, too, under,
until, up, very, was, we, were, what, when, where, which, while, who, whom, why, with,
would, you, your, yours, yourself, yourselves etc.

3.4.4 Lemmatization
Lemmatization is a text pre-processing technique that transforms words into their base or
root form, known as the "lemma." Unlike stemming, which often removes prefixes or
suffixes without considering context, lemmatization takes into account the word's usage
within a sentence. This contextual analysis ensures that words are reduced to their correct
base form based on their meaning in the given context. Additionally, lemmatization
involves part-of-speech awareness, where the grammatical category of a word—whether it
is a noun, verb, adjective, etc.—is considered to accurately determine its lemma.
Furthermore, lemmatization typically relies on dictionaries or lexicons to map words to
their standard base forms, providing consistency and accuracy in the process.

For example, in the sentence "The cats are running quickly," lemmatization would
transform "cats" to "cat," "are" to "be," and "running" to "run," reflecting their base forms
in the context of the sentence.

3.5 Feature Extraction


Term Frequency-Inverse Document Frequency (TF-IDF)

The TF-IDF score for a term t in a document d within a corpus D is given by:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

Where,

Term Frequency (TF) measures how frequently a term t appears in document d. It is


calculated as:

𝐶𝑜𝑢𝑛𝑡(𝑡, 𝑑)
𝑇𝐹(𝑡, 𝑑) =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑒𝑟𝑚𝑠 𝑖𝑛 𝑑

Here, Count(t,d) is the number of times term t appears in document d, and the
denominator is the total number of terms in the document.

12
Inverse Document Frequency (IDF) measures how important a term t is across the
entire corpus D. It is calculated as:

𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝐷


𝐼𝐷𝐹(𝑡, 𝐷) = log( )
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡

Here, Number of Documents Containing t Number of Documents Containing t is the


count of documents in which term t appears at least once.

3.6 Model Implementation


How Does Naive Bayes Classifier Work?
Let’s see how a Naive Bayes classifier works. The formula for conditional probabilities is
given below:

𝑃(𝐵|𝐴). 𝑃(𝐴)
[𝑃(𝐴|𝐵) = ]
𝑃(𝐵)

P(A|B) is the posterior probability, i.e., the probability of a hypothesis A, given that event
B occurs. P(B|A) is likelihood probability, i.e., the probability of the evidence given that
hypothesis A is true. P(A) is prior probability, i.e., the probability of the hypothesis before
observing the evidence, and P(B) is marginal, i.e., the probability of the evidence. When
Bayes’ theorem is applied to classify text documents, the class variable c of a particular
document d is given by:

𝐶𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑐|𝑑) MAP is “maximum a posterior” = most likely class

𝑃(𝑐 |𝑑 )𝑃(𝑐)
= 𝑎𝑟𝑔𝑚𝑎𝑥 Bayes rule
𝑃(𝑑)

= 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑑|𝑐)𝑃(𝑐) Dropping the domain

𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑥1 , 𝑥2 , 𝑥3 , … … … . 𝑥𝑛 |𝑐)𝑃(𝑐) Document d represented as features 𝑥1 , … . . 𝑥𝑛

13
Let the feature conditional probabilities P(x_i | c) be independent of each other (conditional
independence assumption). So,P(x_1, x_2, …, x_n | c) = P(x_1 | c) X P(x_2 | c) X … X
P(x_n | c)

Now, if we consider words as the features of the document, the individual feature
conditional probabilities can be calculated using the following formula:

𝑑𝑜𝑐𝑐𝑜𝑢𝑛𝑡(𝐶 = 𝑐𝑗 )
𝑃̂(𝐶𝑗 ) =
𝑁𝑑𝑜𝑐

𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝑐𝑗 )
𝑃̂(𝑤𝑖 |𝑐𝑗 ) = fraction of times word 𝑤𝑖 appears among all words in
𝑁 𝑑𝑜𝑐

documents of topic 𝑐𝑗 .But what if a given word 𝑤𝑖 does not occur in any training
document of class 𝑐𝑗 but appears in a text document? P(𝑤𝑖 |𝑐𝑗 ) ) will become 0, which
means the probability of the test document belonging to class 𝑐𝑗 will become 0. To avoid
this, Laplace smoothing is introduced, and the conditional feature probabilities are
calculated in the following way:

𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 , 𝑐) + 1
𝑝̂ (𝑤𝑖 |𝑐) =
∑𝑤∈𝑉(𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐) + 1)

𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 , 𝑐) + 1
=
(∑𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐) + |𝑉|

where |V| is the number of unique words in the text corpus. This way we can easily deal
with unseen test words.

Logistic Regression Basics:

Logistic regression predicts the probability that a given input belongs to a certain class. It
uses a logistic (sigmoid) function to map any real-valued number into the range [0, 1].
The logistic function is defined as:

1
𝜎(𝑧) = 1+𝑒 −𝑧

14
Where z=𝑤 𝑇 x+b

 w is the weight vector.


 x is the feature vector of the input data.
 b is the bias term.

Model Output Interpretation

The output of the logistic function, σ(z) , represents the probability that the input belongs to the
positive class.

If σ(z)>0.5, we classify the input as positive; otherwise, it is classified as negative.

Mathematical Framework

The logistic regression model estimates the probability of the positive class as:

𝑃(𝑦 = 1|𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)

The decision boundary is where this probability is 0.5, i.e., 𝑤 𝑇 + 𝑏 = 0.

Training with Maximum Likelihood Estimation

The model parameters w and b are learned by maximizing the likelihood function:
𝑁
𝐿(𝑤, 𝑏) = ∏ 𝑃(𝑦𝑖 |𝑥𝑖 )𝑦𝑖 (1 − 𝑃( 𝑃(𝑦𝑖 |𝑥𝑖 ))1−𝑦𝑖
𝑖=0

Taking the log of the likelihood (log-likelihood) to simplify:

log(𝐿(𝑤, 𝑏)) = ∑𝑁 𝑇 𝑇
𝑖=0 𝑦𝑖 log(𝜎( 𝑤 𝑥 + 𝑏) + (1 − 𝑦𝑖 )log(1 − 𝜎(𝑤 𝑥𝑖 + 𝑏))

We then use optimization techniques like gradient descent to find the optimal w and b.

3.7 Evaluating and Tune Model


The performance of the model is evaluated using the following metrics.

Metrics

• Accuracy: Measures the overall correctness of the model's predictions, calculated as


the ratio of correctly predicted instances to the total instances.

(𝑇𝑃+𝑇𝑁)
Accuracy= (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)

15
• Precision: Measures the proportion of correctly predicted positive instances among the
instances predicted as positive.

TP
Precision: (TP+FP)

• Recall: Measures the proportion of correctly predicted positive instances among all
actual positive instance.

TP
Recall: (TP+FN)

• F1-Score: Harmonic mean of precision and recall, provides a balance between the two
metrics.

(Precision∗Recall)
F1-Score: 2 ∗ (Precision+Recall)

Confusion Matrix

• True Positives (TP): Number of correctly predicted positive instances.

• False Positives (FP): Number of incorrectly predicted positive instances (actually


negative).

• True Negatives (TN): Number of correctly predicted negative instances.

• False Negatives (FN): Number of incorrectly predicted negative instances (actually


positive).

These metrics help in understanding the performance of the model, and classification of
sentiments

3.8 Presentation of Output


The output of the project will be displayed as,

Positive vs. Negative: Output can be presented as a binary classification, where each text
is labeled as either positive or negative sentiment.

Confidence Scores: The calculated P confidence score or probability for each


classification, indicates the model's certainty in its prediction.

16
Visualizations: we will use charts or graphs to visualize the distribution of sentiment
across different texts or categories.

3.9 Implementation Plan

1. Project Overview:

The project is designed to perform sentiment analysis on YouTube video comments to


provide insights into viewer opinions and enhance content strategies. The goal is to classify
comments as positive, negative, or neutral using machine learning models like Naive Bayes
and Logistic Regression.

2. Programming Language and Technology:

The project is implemented in Python, chosen for its robust libraries and frameworks for
natural language processing (NLP) and machine learning.

- Key libraries include:

- NLTK and Scikit-learn: For text preprocessing and model training.

- Pandas and NumPy: For data manipulation and numerical computations.

- Matplotlib: For data visualization.

3. Hardware and Network Requirements:

- Hardware:

A standard development environment with at least 4GB of RAM and 2 CPU cores is used
to efficiently handle data processing and model training.

- Network:

A stable internet connection is required to access the YouTube Data API for collecting
comments and to download necessary libraries.

4. System Design and Integration:

17
- Data Collection:

Video comments are collected using the YouTube Data API, supplemented with additional
datasets like Kaggle’s sentiment datasets and self-classified sarcasm words (2500
sarcasm). Data collection involves making API requests and managing the response data
for further processing.

- Text Preprocessing:

Comments are preprocessed through tokenization, stop-word removal, and


stemming/lemmatization. These steps are essential for converting raw text into a structured
format suitable for model training.

- Model Training:

The Naive Bayes Classifier and Logistic Regression models are used for sentiment
classification. Features like Unigrams and Multinomial Naive Bayes are employed to train
the models on the preprocessed data. Logistic Regression is trained using 1000 iterations
to optimize performance.

- Integration:

- Trained models are integrated into a backend system that processes new comments in
real-time. This system classifies comments based on their sentiment, providing actionable
insights.

- Deployment:

- The system is deployed locally on a localhost environment for testing and demonstration
purposes. This setup allows for efficient development and testing without the need for
cloud platforms.

- Secure management of API keys and user data is ensured during deployment.

5. Testing and Validation:

18
- Testing:

- Unit tests are conducted for individual components, and integration tests are performed
to ensure that the entire system functions as expected.

-Validation:

- Cross-validation techniques are used to assess model accuracy. The dataset is split into
80% for training and 20% for testing. Logistic Regression is validated using 100-fold cross-
validation to ensure model robustness and reliability.

6. Maintenance and Updates:

- The models are periodically updated with new data to maintain classification accuracy.
Continuous monitoring of system performance and user feedback is conducted to address
any issues and improve functionality.

In summary, this project implements a robust sentiment analysis system using Python,
leveraging the Naive Bayes and Logistic Regression classifiers. The system is deployed
locally and provides detailed insights into YouTube video comments, enabling content
creators to understand and respond to viewer feedback effectively.

3.10 Problems Faced

Sarcasm and Irony: Detecting sarcasm and irony remains a significant challenge. These
forms of expression often invert the literal meaning of words, making it difficult for models
to correctly infer the intended sentiment.

Slang and Informal Language: YouTube comments frequently include slang,


abbreviations, and informal language, which can complicate sentiment analysis. Models
trained on formal language may struggle with these unconventional forms of expression.

Mixed Sentiments: Comments may contain mixed sentiments, expressing both positive and
negative opinions. Accurately identifying and categorizing these sentiments requires
sophisticated models capable of understanding context and nuance.

19
Data Imbalance: The distribution of sentiments in YouTube comments can be skewed, with
a higher prevalence of neutral or positive comments. This imbalance can affect model
performance and lead to biased predictions.

20
Chapter 4 RESULT AND ANALYSIS

4.1 Results
The bar chart displays the average term frequency (TF) of the top 20 words in a document.

The bar chart displays the average term frequency (TF) of the top 20 words in a document.

21
The bar chart displays the average term frequency and Inverse document frequency(TF-
IDF) of the top 20 words in a document.

Here is output of a given document, we can conclude that the given document is bears
positive sentiment.

22
4.2 Performance Metrics

Above figure shows the performance metrics and confusion matrix for a binary
classification model. The total dataset size is 1.6 million, and the model's accuracy lies
between 75% and 91%, indicating relatively good performance.

Classification Report

1.Precision

- Negative: 0.75

- Positive: 0.78

2.Recall

- Negative: 0.79

- Positive: 0.74

3.F1-Score

- Negative: 0.77

- Positive: 0.76

These values indicate the balance between precision and recall, where the model is slightly
better at predicting "negative" classes but has comparable performance for both.

23
Confusion Matrix (Right)

- True negatives (correctly classified as negative):650916

- False positives (incorrectly classified as positive):214250

- False negatives (incorrectly classified as negative):171958

- True positives (correctly classified as positive):562876

24
Chapter 5 CONCLUSION

In this sentiment analysis project, we utilized a Naive Bayes classifier to evaluate and
predict sentiments within our dataset. Our model achieved an accuracy of above 75%,
demonstrating its effectiveness in accurately classifying sentiments. A comparative
analysis between our manual implementation and scikit-learn's built-in model revealed that
scikit-learn’s pre-built functionalities provided superior performance, due to its optimized
algorithms and efficient preprocessing capabilities.

We encountered some challenges related to the use of unigrams for feature extraction.
Specifically, the unigram model treated common stop words like "not" as irrelevant, which
led to a potential loss of context when such words were paired with positive or negative
terms. This issue highlighted the limitation of unigrams in capturing the nuanced sentiment
conveyed by multi-word expressions. As a result, incorporating bigram and trigram models
could potentially offer better accuracy by better capturing the context and relationships
between words.

For model persistence, we used pickle files (.pkl) to save and load our trained models.
Pickle is a Python library that serializes and deserializes Python objects, allowing us to
save the state of our model and reuse it without retraining. This approach facilitated
efficient model management and deployment.

Overall, the project was a success. We utilized datasets from Kaggle and manual collection
to ensure a comprehensive evaluation. The combination of theoretical exploration and
practical application has deepened our understanding of sentiment analysis and
demonstrated the value of leveraging advanced tools and techniques for robust results.

25
REFERENCES

[1] T. F. S. N. J. C. A. O. D. J. P. R. George B. Aliman, "Sentiment Analysis using Logistic Regression,"


Journal of Computational Innovations and Engineering Applications, no. JULY 2022, pp. 35-
40, 2022.

[2] B. M. A. G. a. H. S. Sowmya Vajjala, Practical Natural Language Processing, United States of


America: Beth Kelly, 2020.

[3] A. T. Ritika Singh, "YouTube comments Sentimental Analysis," International Journal of


Scientific Research in Engineering and Management (IJSREM), vol. 05, no. May-2021, p. 1,
2021.

[4] S. B. S. PP, ""Sentimental analysis using Naive Bayes classifier"," no. 2019 Mar 30, 2019 .

[5] M. S. L. T. a. T. N. Garcia, "Sentiment Analysis and its Correlation with Engagement Metrics
on YouTube Cooking Recipe Videos," Content Engagement, vol. vol. 20, pp. pp. 512-528,
2023.

26
APPENDIX
Text pre-processing:

Categorize

import pandas as pd

from textblob import TextBlob

df = pd.read_csv('word_count.csv')

print(df.columns)

def get_sentiment(word):

analysis = TextBlob(str(word))

if analysis.sentiment.polarity > 0:

return 'positive'

elif analysis.sentiment.polarity < 0:

return 'negative'

else:

return 'neutral'

df['Sentiment'] = df['word'].apply(get_sentiment)

df.to_csv('final.csv', index=False)

print("Sentiment analysis completed...")

Youtube comment extraction

import os

import pandas as pd

from googleapiclient.discovery import build

from urllib.parse import urlparse, parse_qs

API_KEY = os.getenv("YOUTUBE_API")

# this function is used to get the video comments

def get_comments(video_id):

27
youtube = build('youtube', 'v3', developerKey=API_KEY)

comments = []

request = youtube.commentThreads().list(

part="snippet",

videoId=video_id,

maxResults=100

response = request.execute()

# gets the comments with author name

while request is not None:

for item in response['items']:

comment = item['snippet']['topLevelComment']['snippet']

author = comment['authorDisplayName']

text = comment['textDisplay']

comments.append([author, text])

if 'nextPageToken' in response:

request = youtube.commentThreads().list(

part="snippet",

videoId=video_id,

pageToken=response['nextPageToken'],

maxResults=100

response = request.execute()

else:

break

return comments

# saving data to csv file format

def save_to_csv(comments, filename):

df = pd.DataFrame(comments, columns=['Name', 'Comment'])

28
df.to_csv(filename, index=False)

# extract video id from the video url

def get_video_id(url):

parsed_url = urlparse(url)

# the ID is usually in the 'v' parameter

if parsed_url.netloc in ["www.youtube.com", "youtube.com"]:

query_params = parse_qs(parsed_url.query)

video_id = query_params.get("v")

if video_id:

return video_id[0]

if parsed_url.netloc == "youtu.be":

return parsed_url.path[1:]

return None

video_url = input("Enter video url: ")

video_id = get_video_id(video_url)

comments = get_comments(video_id)

save_to_csv(comments, 'data.csv')

print(f"Extracted {len(comments)} comments and saved to data.csv")

Data Cleaning

import pandas as pd

import re

from nltk.corpus import wordnet, stopwords

from nltk.stem import WordNetLemmatizer

29
import nltk

# Download the WordNet data for lemmatization

nltk.download('wordnet')

"""

WordNet is a large database of English words.

"""

nltk.download('stopwords')

"""

get stopwords dataset

"""

nltk.download('averaged_perceptron_tagger_eng')

"""

It identifies the structure of a sentence (nouns, verbs, etc.)

"""

stop_words = set(stopwords.words('english'))

def clean_text(text):

# removes url from text

text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-
_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

# remove punctuation

text = re.sub(r'[^\w\s]', '', text)

# remove non ascii characters (emojis, any other words)

text = re.sub(r'[^\x00-\x7F]+', '', text)

return text.lower()

# lemmatization: changing, changed, changes to change

def get_wordnet_pos(treebank_tag):

# J, V, N, R are tags (treebank)

30
if treebank_tag.startswith('J'):

return wordnet.ADJ

elif treebank_tag.startswith('V'):

return wordnet.VERB

elif treebank_tag.startswith('N'):

return wordnet.NOUN

elif treebank_tag.startswith('R'):

return wordnet.ADV

else:

return wordnet.NOUN

def remove_stop_words(tokens):

return [word for word in tokens if word not in stop_words]

def lemmatize_word(word):

lemmatizer = WordNetLemmatizer()

"""

WordNetLemmatizer() is a class in the NLTK library that uses the


WordNet lexical database to look up the root word.

"""

# for taking root position words

pos = get_wordnet_pos(nltk.pos_tag([word])[0][1])

return lemmatizer.lemmatize(word, pos)

# def clean_data():

df = pd.read_csv('data.csv')

df['Cleaned_Comment'] = df['Comment'].apply(lambda x: clean_text(x))

df['Tokens'] = df['Cleaned_Comment'].apply(lambda x:
remove_stop_words(x.split()))

tokens = [word for sublist in df['Tokens'].tolist() for word in sublist]

words = [lemmatize_word(word) for word in tokens]

31
word_count = pd.Series(words).value_counts().reset_index()

word_count.columns = ['word', 'count']

word_count.to_csv('word_count.csv', index=False)

Using Bayes
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

df = pd.read_csv('final.csv')

""" count of good is 3 then -> good good good"""

df['text'] = df.apply(lambda row: f"{row['word']} " * row['count'],


axis=1)

text_data = df['text']

sentiments = df['Sentiment']

sentiments = sentiments.map({'positive': 1, 'neutral': 0, 'negative': -


1})

# split dataset for testing and training

# 20% used for testing, 80% for training

X_train, X_test, y_train, y_test = train_test_split(text_data,


sentiments, test_size=0.2, random_state=42, stratify=sentiments)

# Create a pipeline with CountVectorizer and Naive Bayes classifier

model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model

model.fit(X_train, y_train)

# Predict on the test set

y_pred = model.predict(X_test)

# Evaluate the model with zero_division parameter

32
print(classification_report(y_test, y_pred, zero_division=0))

# Predict sentiment for the overall video

"""

i love movie but it is bad

[-, 1, 0, -, -, -, -1]

"""

def aggregate_sentiments(df, model):

predictions = []

for index, row in df.iterrows():

text = f"{row['word']} " * row['count']

sentiment = model.predict([text])[0]

predictions.extend([sentiment] * row['count'])

sentiment_counts = pd.Series(predictions).value_counts()

"""

this is the hashmap containing

1(positive) -> count

-1(negative) -> count

if count is not found then returns 0 else returns count

"""

if sentiment_counts.get(1, 0) > sentiment_counts.get(-1, 0):

return 'positive'

elif sentiment_counts.get(-1, 0) > sentiment_counts.get(1, 0):

return 'negative'

else:

return 'neutral'

overall_sentiment = aggregate_sentiments(df, model)

print(f'Overall Sentiment of the Video: {overall_sentiment}')

33
Frontend: Loading Model and Seeing the result

from googleapiclient.discovery import build

import streamlit as st

import pickle

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

import matplotlib.pyplot as plt

from PIL import Image

def get_video_id(url):

from urllib.parse import urlparse, parse_qs

query = urlparse(url)

if query.hostname == 'youtu.be':

return query.path[1:]

if query.hostname in {'www.youtube.com', 'youtube.com'}:

if query.path == '/watch':

return parse_qs(query.query)['v'][0]

if query.path[:7] == '/embed/':

return query.path.split('/')[2]

if query.path[:3] == '/v/':

return query.path.split('/')[2]

return None

def fetch_youtube_comments(video_id, api_key, max_results=1000):

youtube = build('youtube', 'v3', developerKey=api_key)

comments = []

response = youtube.commentThreads().list(

part='snippet',

34
videoId=video_id,

maxResults=min(max_results, 100),

textFormat='plainText'

).execute()

while response:

for item in response['items']:

comment =
item['snippet']['topLevelComment']['snippet']['textDisplay']

comments.append(comment)

# Check for next page token

if 'nextPageToken' in response and len(comments) < max_results:

response = youtube.commentThreads().list(

part='snippet',

videoId=video_id,

pageToken=response['nextPageToken'],

maxResults=min(max_results - len(comments), 100)

).execute()

else:

break

return comments

def load_model():

with open('../pkl/comment_model.pkl', 'rb') as f:

model = pickle.load(f)

with open('../pkl//vectorizer.pkl', 'rb') as f:

vectorizer = pickle.load(f)

return model, vectorizer

cmnt_model, td_vectorize = load_model()

def preprocess_text(text):

stop_words = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

tokens = word_tokenize(text.lower())

35
filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if
word.isalnum() and word not in stop_words]

return ' '.join(filtered_tokens)

API_KEY = 'AIzaSyC81K7gktMIJx9psPQLcQ74ZNM1TbPSaZ0'

st.title('YouTube Comment Sentiment Analysis')

st.write('Enter a YouTube video URL and analyze the comments for


sentiment:')

youtube_url = st.text_input('Enter YouTube video URL:')

num_comments = st.slider('Number of comments to fetch:', min_value=10,


max_value=1000, value=100)

if st.button('Analyze Comments'):

if youtube_url:

video_id = get_video_id(youtube_url)

if video_id:

comments = fetch_youtube_comments(video_id, API_KEY,


num_comments)

if comments:

st.write(f"Fetched {len(comments)} comments. Analyzing


sentiment...")

# Preprocess and analyze comments

processed_comments = [preprocess_text(comment) for comment in


comments]

transformed_comments =
td_vectorize.transform(processed_comments)

predictions_proba =
cmnt_model.predict_proba(transformed_comments)

# Calculate positive and negative sentiments

positive_count = sum(pred[1] > pred[0] for pred in


predictions_proba)

negative_count = len(predictions_proba) - positive_count

positive_percentage = (positive_count /
len(predictions_proba)) * 100

36
negative_percentage = 100 - positive_percentage

st.write(f"Positive Sentiment: {positive_percentage:.2f}%")

st.write(f"Negative Sentiment: {negative_percentage:.2f}%")

sentiment = ""

if (positive_percentage > negative_percentage):

sentiment = "Positive"

elif(negative_percentage > positive_percentage):

sentiment = "Negative"

else:

sentiment = "Neutral"

if sentiment == "Positive":

image_path = 'assets/1.png'

elif sentiment == "Negative":

image_path = 'assets/2.png'

else:

image_path = 'assets/3.png'

image = Image.open(image_path)

col1, col2, col3 = st.columns(3)

col2.image(image, width=250)

fig, ax = plt.subplots()

fig.patch.set_facecolor('#f0f0f0')

ax.bar(['Positive', 'Negative'], [positive_percentage,


negative_percentage], color=['#a6d189', '#e78284'])

ax.set_ylim(0, 100)

ax.set_ylabel('Percentage (%)')

ax.set_title('Overall Sentiment')

st.pyplot(fig)

fig, ax = plt.subplots()

fig.patch.set_facecolor('#f0f0f0')

ax.pie([positive_percentage, negative_percentage],
labels=['Positive', 'Negative'], autopct='%1.1f%%', colors=['#a6d189',
'#e78284'])

ax.set_title('Sentiment Distribution')

37
st.pyplot(fig)

else:

st.write("No comments found for this video.")

else:

st.write("Invalid YouTube URL.")

else:

st.write("Please enter a valid YouTube video URL.")

38

You might also like