Final 011
Final 011
Final 011
on
Submitted By:
Submitted To:
Sanepa-2 Lalitpur
23 September, 2024
DECLARATION
We hereby declare that the report of the project entitled “Sentimental Analysis of
YouTube Video Comments” which is being submitted to the Department of Computer
and Information Technology Engineering, Everest Engineering College, Sanepa-2
Lalitpur, in the partial fulfillment of the requirements for the award of the Degree of
Bachelor of Engineering in Computer Engineering, is a bona fide report of the work carried
out by us. We are responsible for the work submitted in this project, that the original work
is our own except as specified in the references and acknowledgements, and that the
original work contained herein have not been undertaken or done by unspecified sources
or persons.
……………………………..
External Examiner
Name: ……………………………….
Designation: …………………………
Affiliation: ………………………….
……………………………………….
PROJECT COMMITTEE
Department of Computer and IT Engineering
Everest Engineering College, Sanepa
COPYRIGHT
The author has agreed that the library, Everest Engineering College (EEC), Sanepa-2,
Lalitpur may make this report freely available for inspection. Moreover, the author has
agreed that permission for extensive copying of this project report for scholarly purpose
may be granted by the lecturers, who supervised the project works recorded herein or, in
their absence, by the Head of Department wherein the project report was done. It is
understood that the recognition will be given to the author of the report and to the
Department of Information Technology, Engineering EEC in any use of the material of this
project report. Copying or publication or other use of this report for financial gain without
approval of the Department and author’s written permission is prohibited. Request for
permission to copy or to make any other use of the material in this report in whole or in
part should be addressed to the Head of Department, Department of Computer and
Information Technology Engineering.
ACKNOWLEDGEMENT
The success of this project required a lot of guidance and assistance from many people and
we are extremely fortunate to have got this all along the proposal defense of third year
project work. Whatever we have done is only due to such guidance and assistance. Firstly,
we would like to thank Pokhara University for including the third year project as a part of
our curriculum. We would like to express our sincere gratitude to our project committee
members for their guidance and support throughout the development of this project. Most
importantly, we would like to thank our project committee, their expertise and insights
were invaluable in selecting and creating the system.
i
ABSTRACT
This project focuses on the sentiment analysis of YouTube video comments to enhance
video recommendations and improve user experience. YouTube's vast and diverse content
presents a challenge in maintaining quality and relevance. To address this, sentiment
analysis is employed to classify comments as positive, negative, or neutral, providing
valuable feedback for categorizing videos and refining the recommendation system. The
analysis utilizes both Naive Bayes and Logistic Regression classifiers. Specifically, the
Multinomial Naive Bayes Classifier, known for its effectiveness in text classification, was
implemented alongside Unigram features to capture the context and sentiment of
comments. This approach involved preprocessing text data through tokenization, stop-
word removal, and stemming to prepare it for classification. The Naive Bayes classifier,
trained with Unigram features, was used to predict the sentiment of comments, while
Logistic Regression was applied to further validate the results. By accurately identifying
viewer sentiment, the system improves the recommendation engine by promoting videos
with positive feedback and filtering out those with negative responses. This benefits both
content creators, who gain insights into audience perception, and viewers, who receive
more relevant and engaging content. Ultimately, the project enhances user satisfaction by
ensuring a more personalized and trustworthy platform experience.
ii
Table of Contents
ACKNOWLEDGEMENT ................................................................................................... i
ABSTRACT ...................................................................................................................... ii
List of Figures .................................................................................................................... iv
List of Abbreviation ............................................................................................................ v
Chapter 1 INTRODUCTION....................................................................................... 1
1.1 Introduction .......................................................................................................... 1
1.2 Problem Statement ............................................................................................... 2
1.3 Objectives ............................................................................................................. 2
1.4 Scope and Applications ........................................................................................ 3
Chapter 2 LITERATURE REVIEW .......................................................................... 4
Chapter 3 METHODOLOGY ..................................................................................... 7
3.1 System Block Diagram......................................................................................... 7
3.2 Hardware and software required .......................................................................... 9
3.3 Data Collection ..................................................................................................... 9
3.4 Pre- processing ................................................................................................... 10
3.5 Feature Extraction .............................................................................................. 12
3.6 Model Implementation ....................................................................................... 13
3.7 Evaluating and Tune Model ............................................................................... 15
3.8 Presentation of Output ........................................................................................ 16
3.9 Implementation Plan .......................................................................................... 17
3.10 Problems Faced .................................................................................................. 19
Chapter 4 RESULT AND ANALYSIS ...................................................................... 21
4.1 Results ................................................................................................................ 21
4.2 Performance Metrics .......................................................................................... 23
Chapter 5 CONCLUSION ......................................................................................... 25
REFERENCES ................................................................................................................ 26
APPENDIX ..................................................................................................................... 27
iii
List of Figures
iv
List of Abbreviation
v
Chapter 1 INTRODUCTION
1.1 Introduction
YouTube is one of the most popular platforms for sharing and viewing videos, with over 2
billion active users every month. It offers content ranging from tutorials to entertainment,
and its open platform allows anyone to upload videos and share their ideas with the world.
However, maintaining the quality and reliability of content is a challenge due to the vast
amount of content being uploaded.
Sentiment analysis has evolved over the years, starting in the early 2000s with research
into understanding emotional tones in product reviews and online discussions. As machine
learning and NLP techniques developed, sentiment analysis became more sophisticated
and is now used across various platforms, including social media, product feedback, and
video comments.
Sentiment analysis scans comments to identify negative reactions. When a video receives
a large number of negative comments, it may indicate that the video is misleading or of
low quality. Based on this, YouTube can adjust its recommendation algorithm to stop
promoting such videos. For example, if comments frequently mention terms like
"clickbait" or "disappointing," sentiment analysis can flag the video as untrustworthy.
Sentiment analysis can detect patterns in spam or irrelevant comments by analyzing
language use and tone. Spam comments that repeat generic phrases or promote unrelated
content can be identified and filtered out. For instance, comments like "Great video!"
posted repetitively across many unrelated videos could be marked as spam. By
understanding the emotional tone of comments (whether they are positive, negative, or
1
neutral), sentiment analysis can improve YouTube’s recommendation system. Videos with
mostly positive comments can be promoted, while those with mixed or negative feedback
can be demoted, helping users discover more relevant content.
For example, videos with comments such as "Loved this video!" or "Very helpful" indicate
high engagement and quality, making them more suitable for recommendation.
Thus, this project accurately captures the sentiments of viewers regarding YouTube video
content, providing valuable insights for both content creators and new viewers. It helps
creators understand what type of content resonates well with the audience, while also
guiding viewers towards higher-quality videos. This ensures clarity on the kinds of content
that should be created and consumed, ultimately benefiting the entire YouTube community.
The problem is to develop an effective system that can accurately analyze and categorize
the sentiments expressed in YouTube video comments as positive, negative, or neutral.
This system should be able to process large volumes of data efficiently, providing real-
time insights to help creators tailor their content strategies, improve audience engagement,
and assist moderators in identifying and addressing potential issues within the community.
Ultimately, this project aims to enhance the overall quality of content and interactions on
the YouTube platform through data-driven sentiment analysis.
1.3 Objectives
To analyze the sentiment of YouTube video comments and differentiates the category for
positive and negative.
2
1.4 Scope and Applications
This project focuses on analyzing viewer sentiments in YouTube video comments. By
categorizing comments as positive, negative, or neutral using sentiment analysis
techniques, it provides insights into how videos are perceived by the audience. The system
efficiently processes large volumes of comments, helping content creators understand
viewer feedback and improve their content strategies. Additionally, it can identify videos
with predominantly negative sentiments, which may indicate low-quality or misleading
content, allowing for better moderation and content management.
The application of this project spans several areas on the YouTube platform. For content
creators, sentiment analysis provides a clearer picture of how their content is received,
allowing them to adjust their approach based on feedback. For viewers, it enhances the
recommendation engine by promoting videos with positive feedback and reducing
exposure to negatively reviewed content. This project also benefits advertisers, as it enables
better ad targeting by analyzing viewer sentiments to match ads with appropriate content.
Moreover, it can be integrated into YouTube’s moderation system to detect and filter out
spam, misinformation, and irrelevant comments, ensuring a cleaner and more trustworthy
interaction environment.
3
Chapter 2 LITERATURE REVIEW
Text classification is a widely used task in NLP. It has many applications, including
categorizing customer feedback, also filtering spam comments. To develop sentiment
analysis, a large dataset is used. This dataset includes positive and negative sentiments.
The goal is to separate the comments based on these datasets. Datasets offer extensive
functionality for manipulating data, but it can be useful to convert a Dataset object to a
Pandas Data Frame for easier data visualization.
YouTube comments are a rich source of feedback and opinions on videos, products, and
services. Analyzing these comments allows for insights into user sentiments, trends, and
reactions to content. This information is essential for improving content strategies,
moderating harmful interactions, and enhancing user engagement. Given the sheer volume
of comments, manual analysis is impractical, making automated sentiment analysis a
necessity for deriving actionable insights. Sentiment analysis encompasses a range of
methodologies, which can be broadly categorized into traditional rule-based systems and
modern machine learning (ML) and deep learning approaches. Rule-Based Systems:
Traditional sentiment analysis methods rely on predefined rules and lexicons to evaluate
sentiment. These systems use lists of words associated with positive or negative sentiments
and apply grammatical rules to infer the overall sentiment of a text. While straightforward,
rule-based systems often struggle with nuances such as context and sarcasm.
4
sentiment. Naive Bayes is a probabilistic classifier based on Bayes' theorem, which
assumes feature independence. Despite its simplicity, Naive Bayes performs well in text
classification tasks [2].
Previous work by Siersdorfer et al. on YouTube analyzed over 6 million comments from
67,000 videos. They built prediction models to forecast the rating of new comments based
on previous user interactions. Pang, Lee, and Vaithyanathan performed sentiment analysis
on movie reviews, using ML algorithms such as Naïve Bayes and Support Vector Machines
(SVM). They showed that machine learning outperforms manual classification, but
sentiment analysis remains a complex problem due to conflicting positive and negative
expressions in the text. Smita Shree and Josh Brolin adopted a lexicon-based approach to
identify sentiment polarity in YouTube comments. However, their study revealed a lower
recall rate for negative sentiments due to linguistic variations in expressing dissatisfaction.
Other works, such as those by A. Kowcika et al., demonstrated how sentiment analysis on
Twitter could reveal a correlation between individual moods and real-world events. The
authors utilized features such as lemmatization, tokenization, n-grams, and stop word
removal to enhance the model's performance. By applying these techniques, the system
improved the accuracy of sentiment classification [3].
Sentiment analysis has been traditionally applied to various social media platforms such as
Twitter, Facebook, and blogs. YouTube, being one of the largest video-sharing platforms
globally, offers a massive dataset of comments reflecting users’ opinions on diverse topics.
The sentiment analysis of YouTube video comments presents unique challenges due to the
informal, unstructured, and often noisy nature of the data. Several research efforts have
been made to explore sentiment in YouTube comments to understand viewer reactions,
brand perception, and user engagement. Studies have demonstrated the use of machine
learning and natural language processing (NLP) techniques to classify and extract
sentiments from these comments. One of the primary algorithms used is the Naive Bayes
classifier due to its simplicity and effectiveness in text classification tasks. The article by
Analytics Vidhya highlights the significance of the Naive Bayes algorithm and its utility
in sentiment analysis tasks, including its use in YouTube comment analysis [4].
5
The Naive Bayes classifier is a probabilistic model used extensively in sentiment analysis.
The article from Analytics Vidhya (2022) explains how Naive Bayes can be built from
scratch and applied to sentiment analysis. The Naive Bayes model operates on the
assumption that each feature (word or token) contributes independently to the final
classification. Although this assumption does not hold in all real-world cases, it has proven
to be a robust and efficient model for text classification, especially in domains with a high
volume of text data such as YouTube comments. The steps involved in creating a Naive
Bayes sentiment analysis classifier include, Preprocessing: Cleaning the comments by
removing stop words, punctuation, and converting the text to lowercase. Feature
Extraction: Transforming the text into a feature vector using methods such as TF-IDF
(Term Frequency-Inverse Document Frequency) Model Building: Utilizing the Naive
Bayes classifier to train on labeled data (positive, negative, or neutral sentiments)
.Prediction and Evaluation: Testing the model on unseen data and evaluating its
performance using metrics like accuracy, precision, recall, and F1-score.
The article from Analytics Vidhya mentions the importance of preprocessing to improve
model performance. Handling issues like spelling mistakes, lemmatization, and stemming
plays a crucial role in preparing the text for analysis. Despite this, achieving high accuracy
in predicting sentiment remains challenging, and the use of more complex models like deep
learning or may be required to overcome these issues in future research. Several studies are
also exploring multimodal sentiment analysis, where comments are analyzed in
conjunction with video metadata, audio features, or visual cues to better understand overall
user sentiment. This area of research could significantly enhance the understanding of
sentiment in YouTube videos by providing a richer, more comprehensive analysis. Overall,
Naive Bayes provides a solid foundation for sentiment analysis, and its application to
YouTube comments remains an important area of research, contributing valuable insights
into public opinion on digital platforms [5].
6
Chapter 3 METHODOLOGY
7
Figure 3-2 Sentiment analysis implementation
8
3.2 Hardware and software required
Software Requirement
Libraries/Framework: pandas,streamlit,matplotlib,nltk,pickle,numpy,scikit-learn
Hardware requirement
RAM: 8GB
9
3.4 Pre- processing
In data preprocessing, we aim to transform raw data into a clean, structured, and suitable
format for analysis or model training. After data extraction, arranging the data into the
required format is necessary to use the algorithm effectively. For example, let the extracted
data be "I,’am a good boy!. It Doesn't mean everyone loves me." Here, the raw data is
messy, containing noise, inconsistencies, and irrelevant information that can negatively
impact the performance of analytical models.
Thus, the required input for Naive Bayes and KNN is in the form of cleaned and processed
text. First, we remove special characters and punctuation, leading to "I am a good boy, It
Doesn't mean everyone loves me." Next, we convert all text to lowercase for consistency:
"I am a good boy it doesn't mean everyone loves me." We then remove stopwords like
"am" and "a": "I good boy it doesn’t mean everyone loves me." Finally, we lemmatize the
words to their root forms: "I good boy it not mean everyone love me." This processed text
is now ready for further analysis using Naive Bayes, KNN, or other machine learning
algorithms. The brief description of the different steps involved is illustrated below
10
3.4.3 Tokenization
In tokenization, we split the text or sentence into individual words or tokens. This step is
crucial because it converts a string of text into a list of words, allowing for more granular
analysis. By breaking down the text into smaller units, we can analyze the frequency of
words, their relationships, and patterns within the text. This process involves identifying
the boundaries between words, punctuation, and other elements, effectively segmenting the
text into its basic components. Tokenization enables us to focus on individual words, which
is essential for sentiment analysis, where the presence or absence of specific words can
significantly impact the results. For instance, by tokenizing the sentence "I good boy it not
means everyone love me," we can isolate each word and analyze its role and importance in
the sentence, providing a foundation for further preprocessing steps like stopword removal
and lemmatization.
Tokenization process:
Tokens["I", “am”, “a”, “good", "boy", "it", "doesn’t", "mean", "everyone", "love",
"me"]
Stopwords
Stopwords are common words that typically do not carry significant meaning on their own,
such as "and," "the," "is," and "in." Thus, removing stopwords helps to focus on the words
that are more meaningful and relevant to the sentiment analysis, improving the efficiency
and performance of the model. In case of Navie Bayes Classifier the removal of stopwords
is contradictory. Since, they are context dependent. For example, Probability of …
a, about, above, after, again, all, am, an, and, any, are, as, at, be, because, been, before,
being, below, between, both, but, by, can, could, did, do, does, doing, down, during, each,
few, for, from, further, had, has, have, he, her, here, hers, herself, him, himself, his, how,
I, if, in, into, is, it, its, itself, let's, me, more, most, my, myself, of, off, on, once, only, or,
other, our, ours, ourselves, out, over, own, same, should, so, some, such, than, that, the,
11
their, theirs, them, themselves, then, there, these, they, this, those, though, to, too, under,
until, up, very, was, we, were, what, when, where, which, while, who, whom, why, with,
would, you, your, yours, yourself, yourselves etc.
3.4.4 Lemmatization
Lemmatization is a text pre-processing technique that transforms words into their base or
root form, known as the "lemma." Unlike stemming, which often removes prefixes or
suffixes without considering context, lemmatization takes into account the word's usage
within a sentence. This contextual analysis ensures that words are reduced to their correct
base form based on their meaning in the given context. Additionally, lemmatization
involves part-of-speech awareness, where the grammatical category of a word—whether it
is a noun, verb, adjective, etc.—is considered to accurately determine its lemma.
Furthermore, lemmatization typically relies on dictionaries or lexicons to map words to
their standard base forms, providing consistency and accuracy in the process.
For example, in the sentence "The cats are running quickly," lemmatization would
transform "cats" to "cat," "are" to "be," and "running" to "run," reflecting their base forms
in the context of the sentence.
The TF-IDF score for a term t in a document d within a corpus D is given by:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
Where,
𝐶𝑜𝑢𝑛𝑡(𝑡, 𝑑)
𝑇𝐹(𝑡, 𝑑) =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑒𝑟𝑚𝑠 𝑖𝑛 𝑑
Here, Count(t,d) is the number of times term t appears in document d, and the
denominator is the total number of terms in the document.
12
Inverse Document Frequency (IDF) measures how important a term t is across the
entire corpus D. It is calculated as:
𝑃(𝐵|𝐴). 𝑃(𝐴)
[𝑃(𝐴|𝐵) = ]
𝑃(𝐵)
P(A|B) is the posterior probability, i.e., the probability of a hypothesis A, given that event
B occurs. P(B|A) is likelihood probability, i.e., the probability of the evidence given that
hypothesis A is true. P(A) is prior probability, i.e., the probability of the hypothesis before
observing the evidence, and P(B) is marginal, i.e., the probability of the evidence. When
Bayes’ theorem is applied to classify text documents, the class variable c of a particular
document d is given by:
𝑃(𝑐 |𝑑 )𝑃(𝑐)
= 𝑎𝑟𝑔𝑚𝑎𝑥 Bayes rule
𝑃(𝑑)
13
Let the feature conditional probabilities P(x_i | c) be independent of each other (conditional
independence assumption). So,P(x_1, x_2, …, x_n | c) = P(x_1 | c) X P(x_2 | c) X … X
P(x_n | c)
Now, if we consider words as the features of the document, the individual feature
conditional probabilities can be calculated using the following formula:
𝑑𝑜𝑐𝑐𝑜𝑢𝑛𝑡(𝐶 = 𝑐𝑗 )
𝑃̂(𝐶𝑗 ) =
𝑁𝑑𝑜𝑐
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝑐𝑗 )
𝑃̂(𝑤𝑖 |𝑐𝑗 ) = fraction of times word 𝑤𝑖 appears among all words in
𝑁 𝑑𝑜𝑐
documents of topic 𝑐𝑗 .But what if a given word 𝑤𝑖 does not occur in any training
document of class 𝑐𝑗 but appears in a text document? P(𝑤𝑖 |𝑐𝑗 ) ) will become 0, which
means the probability of the test document belonging to class 𝑐𝑗 will become 0. To avoid
this, Laplace smoothing is introduced, and the conditional feature probabilities are
calculated in the following way:
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 , 𝑐) + 1
𝑝̂ (𝑤𝑖 |𝑐) =
∑𝑤∈𝑉(𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐) + 1)
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 , 𝑐) + 1
=
(∑𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐) + |𝑉|
where |V| is the number of unique words in the text corpus. This way we can easily deal
with unseen test words.
Logistic regression predicts the probability that a given input belongs to a certain class. It
uses a logistic (sigmoid) function to map any real-valued number into the range [0, 1].
The logistic function is defined as:
1
𝜎(𝑧) = 1+𝑒 −𝑧
14
Where z=𝑤 𝑇 x+b
The output of the logistic function, σ(z) , represents the probability that the input belongs to the
positive class.
Mathematical Framework
The logistic regression model estimates the probability of the positive class as:
The model parameters w and b are learned by maximizing the likelihood function:
𝑁
𝐿(𝑤, 𝑏) = ∏ 𝑃(𝑦𝑖 |𝑥𝑖 )𝑦𝑖 (1 − 𝑃( 𝑃(𝑦𝑖 |𝑥𝑖 ))1−𝑦𝑖
𝑖=0
log(𝐿(𝑤, 𝑏)) = ∑𝑁 𝑇 𝑇
𝑖=0 𝑦𝑖 log(𝜎( 𝑤 𝑥 + 𝑏) + (1 − 𝑦𝑖 )log(1 − 𝜎(𝑤 𝑥𝑖 + 𝑏))
We then use optimization techniques like gradient descent to find the optimal w and b.
Metrics
(𝑇𝑃+𝑇𝑁)
Accuracy= (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
15
• Precision: Measures the proportion of correctly predicted positive instances among the
instances predicted as positive.
TP
Precision: (TP+FP)
• Recall: Measures the proportion of correctly predicted positive instances among all
actual positive instance.
TP
Recall: (TP+FN)
• F1-Score: Harmonic mean of precision and recall, provides a balance between the two
metrics.
(Precision∗Recall)
F1-Score: 2 ∗ (Precision+Recall)
Confusion Matrix
These metrics help in understanding the performance of the model, and classification of
sentiments
Positive vs. Negative: Output can be presented as a binary classification, where each text
is labeled as either positive or negative sentiment.
16
Visualizations: we will use charts or graphs to visualize the distribution of sentiment
across different texts or categories.
1. Project Overview:
The project is implemented in Python, chosen for its robust libraries and frameworks for
natural language processing (NLP) and machine learning.
- Hardware:
A standard development environment with at least 4GB of RAM and 2 CPU cores is used
to efficiently handle data processing and model training.
- Network:
A stable internet connection is required to access the YouTube Data API for collecting
comments and to download necessary libraries.
17
- Data Collection:
Video comments are collected using the YouTube Data API, supplemented with additional
datasets like Kaggle’s sentiment datasets and self-classified sarcasm words (2500
sarcasm). Data collection involves making API requests and managing the response data
for further processing.
- Text Preprocessing:
- Model Training:
The Naive Bayes Classifier and Logistic Regression models are used for sentiment
classification. Features like Unigrams and Multinomial Naive Bayes are employed to train
the models on the preprocessed data. Logistic Regression is trained using 1000 iterations
to optimize performance.
- Integration:
- Trained models are integrated into a backend system that processes new comments in
real-time. This system classifies comments based on their sentiment, providing actionable
insights.
- Deployment:
- The system is deployed locally on a localhost environment for testing and demonstration
purposes. This setup allows for efficient development and testing without the need for
cloud platforms.
- Secure management of API keys and user data is ensured during deployment.
18
- Testing:
- Unit tests are conducted for individual components, and integration tests are performed
to ensure that the entire system functions as expected.
-Validation:
- Cross-validation techniques are used to assess model accuracy. The dataset is split into
80% for training and 20% for testing. Logistic Regression is validated using 100-fold cross-
validation to ensure model robustness and reliability.
- The models are periodically updated with new data to maintain classification accuracy.
Continuous monitoring of system performance and user feedback is conducted to address
any issues and improve functionality.
In summary, this project implements a robust sentiment analysis system using Python,
leveraging the Naive Bayes and Logistic Regression classifiers. The system is deployed
locally and provides detailed insights into YouTube video comments, enabling content
creators to understand and respond to viewer feedback effectively.
Sarcasm and Irony: Detecting sarcasm and irony remains a significant challenge. These
forms of expression often invert the literal meaning of words, making it difficult for models
to correctly infer the intended sentiment.
Mixed Sentiments: Comments may contain mixed sentiments, expressing both positive and
negative opinions. Accurately identifying and categorizing these sentiments requires
sophisticated models capable of understanding context and nuance.
19
Data Imbalance: The distribution of sentiments in YouTube comments can be skewed, with
a higher prevalence of neutral or positive comments. This imbalance can affect model
performance and lead to biased predictions.
20
Chapter 4 RESULT AND ANALYSIS
4.1 Results
The bar chart displays the average term frequency (TF) of the top 20 words in a document.
The bar chart displays the average term frequency (TF) of the top 20 words in a document.
21
The bar chart displays the average term frequency and Inverse document frequency(TF-
IDF) of the top 20 words in a document.
Here is output of a given document, we can conclude that the given document is bears
positive sentiment.
22
4.2 Performance Metrics
Above figure shows the performance metrics and confusion matrix for a binary
classification model. The total dataset size is 1.6 million, and the model's accuracy lies
between 75% and 91%, indicating relatively good performance.
Classification Report
1.Precision
- Negative: 0.75
- Positive: 0.78
2.Recall
- Negative: 0.79
- Positive: 0.74
3.F1-Score
- Negative: 0.77
- Positive: 0.76
These values indicate the balance between precision and recall, where the model is slightly
better at predicting "negative" classes but has comparable performance for both.
23
Confusion Matrix (Right)
24
Chapter 5 CONCLUSION
In this sentiment analysis project, we utilized a Naive Bayes classifier to evaluate and
predict sentiments within our dataset. Our model achieved an accuracy of above 75%,
demonstrating its effectiveness in accurately classifying sentiments. A comparative
analysis between our manual implementation and scikit-learn's built-in model revealed that
scikit-learn’s pre-built functionalities provided superior performance, due to its optimized
algorithms and efficient preprocessing capabilities.
We encountered some challenges related to the use of unigrams for feature extraction.
Specifically, the unigram model treated common stop words like "not" as irrelevant, which
led to a potential loss of context when such words were paired with positive or negative
terms. This issue highlighted the limitation of unigrams in capturing the nuanced sentiment
conveyed by multi-word expressions. As a result, incorporating bigram and trigram models
could potentially offer better accuracy by better capturing the context and relationships
between words.
For model persistence, we used pickle files (.pkl) to save and load our trained models.
Pickle is a Python library that serializes and deserializes Python objects, allowing us to
save the state of our model and reuse it without retraining. This approach facilitated
efficient model management and deployment.
Overall, the project was a success. We utilized datasets from Kaggle and manual collection
to ensure a comprehensive evaluation. The combination of theoretical exploration and
practical application has deepened our understanding of sentiment analysis and
demonstrated the value of leveraging advanced tools and techniques for robust results.
25
REFERENCES
[4] S. B. S. PP, ""Sentimental analysis using Naive Bayes classifier"," no. 2019 Mar 30, 2019 .
[5] M. S. L. T. a. T. N. Garcia, "Sentiment Analysis and its Correlation with Engagement Metrics
on YouTube Cooking Recipe Videos," Content Engagement, vol. vol. 20, pp. pp. 512-528,
2023.
26
APPENDIX
Text pre-processing:
Categorize
import pandas as pd
df = pd.read_csv('word_count.csv')
print(df.columns)
def get_sentiment(word):
analysis = TextBlob(str(word))
if analysis.sentiment.polarity > 0:
return 'positive'
return 'negative'
else:
return 'neutral'
df['Sentiment'] = df['word'].apply(get_sentiment)
df.to_csv('final.csv', index=False)
import os
import pandas as pd
API_KEY = os.getenv("YOUTUBE_API")
def get_comments(video_id):
27
youtube = build('youtube', 'v3', developerKey=API_KEY)
comments = []
request = youtube.commentThreads().list(
part="snippet",
videoId=video_id,
maxResults=100
response = request.execute()
comment = item['snippet']['topLevelComment']['snippet']
author = comment['authorDisplayName']
text = comment['textDisplay']
comments.append([author, text])
if 'nextPageToken' in response:
request = youtube.commentThreads().list(
part="snippet",
videoId=video_id,
pageToken=response['nextPageToken'],
maxResults=100
response = request.execute()
else:
break
return comments
28
df.to_csv(filename, index=False)
def get_video_id(url):
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
video_id = query_params.get("v")
if video_id:
return video_id[0]
if parsed_url.netloc == "youtu.be":
return parsed_url.path[1:]
return None
video_id = get_video_id(video_url)
comments = get_comments(video_id)
save_to_csv(comments, 'data.csv')
Data Cleaning
import pandas as pd
import re
29
import nltk
nltk.download('wordnet')
"""
"""
nltk.download('stopwords')
"""
"""
nltk.download('averaged_perceptron_tagger_eng')
"""
"""
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-
_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
# remove punctuation
return text.lower()
def get_wordnet_pos(treebank_tag):
30
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def remove_stop_words(tokens):
def lemmatize_word(word):
lemmatizer = WordNetLemmatizer()
"""
"""
pos = get_wordnet_pos(nltk.pos_tag([word])[0][1])
# def clean_data():
df = pd.read_csv('data.csv')
df['Tokens'] = df['Cleaned_Comment'].apply(lambda x:
remove_stop_words(x.split()))
31
word_count = pd.Series(words).value_counts().reset_index()
word_count.to_csv('word_count.csv', index=False)
Using Bayes
import pandas as pd
df = pd.read_csv('final.csv')
text_data = df['text']
sentiments = df['Sentiment']
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
32
print(classification_report(y_test, y_pred, zero_division=0))
"""
[-, 1, 0, -, -, -, -1]
"""
predictions = []
sentiment = model.predict([text])[0]
predictions.extend([sentiment] * row['count'])
sentiment_counts = pd.Series(predictions).value_counts()
"""
"""
return 'positive'
return 'negative'
else:
return 'neutral'
33
Frontend: Loading Model and Seeing the result
import streamlit as st
import pickle
def get_video_id(url):
query = urlparse(url)
if query.hostname == 'youtu.be':
return query.path[1:]
if query.path == '/watch':
return parse_qs(query.query)['v'][0]
if query.path[:7] == '/embed/':
return query.path.split('/')[2]
if query.path[:3] == '/v/':
return query.path.split('/')[2]
return None
comments = []
response = youtube.commentThreads().list(
part='snippet',
34
videoId=video_id,
maxResults=min(max_results, 100),
textFormat='plainText'
).execute()
while response:
comment =
item['snippet']['topLevelComment']['snippet']['textDisplay']
comments.append(comment)
response = youtube.commentThreads().list(
part='snippet',
videoId=video_id,
pageToken=response['nextPageToken'],
).execute()
else:
break
return comments
def load_model():
model = pickle.load(f)
vectorizer = pickle.load(f)
def preprocess_text(text):
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text.lower())
35
filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if
word.isalnum() and word not in stop_words]
API_KEY = 'AIzaSyC81K7gktMIJx9psPQLcQ74ZNM1TbPSaZ0'
if st.button('Analyze Comments'):
if youtube_url:
video_id = get_video_id(youtube_url)
if video_id:
if comments:
transformed_comments =
td_vectorize.transform(processed_comments)
predictions_proba =
cmnt_model.predict_proba(transformed_comments)
positive_percentage = (positive_count /
len(predictions_proba)) * 100
36
negative_percentage = 100 - positive_percentage
sentiment = ""
sentiment = "Positive"
sentiment = "Negative"
else:
sentiment = "Neutral"
if sentiment == "Positive":
image_path = 'assets/1.png'
image_path = 'assets/2.png'
else:
image_path = 'assets/3.png'
image = Image.open(image_path)
col2.image(image, width=250)
fig, ax = plt.subplots()
fig.patch.set_facecolor('#f0f0f0')
ax.set_ylim(0, 100)
ax.set_ylabel('Percentage (%)')
ax.set_title('Overall Sentiment')
st.pyplot(fig)
fig, ax = plt.subplots()
fig.patch.set_facecolor('#f0f0f0')
ax.pie([positive_percentage, negative_percentage],
labels=['Positive', 'Negative'], autopct='%1.1f%%', colors=['#a6d189',
'#e78284'])
ax.set_title('Sentiment Distribution')
37
st.pyplot(fig)
else:
else:
else:
38