Kartik-20CS46 Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

A SEMINAR REPORT

On
SENTIMENT ANALYSIS : OPINION MINING

Submitted in partial fulfillment for requirement


For the award of the degree of
Bachelor of Technology
in
Computer Science Engineering

(Bikaner Technical University)

Session : 2023-24
SUBMITTED TO: SUBMITTED BY:
Mr. H.R. Choudhary Kartik Yadav
Department of CSE 20EEACS047
7TH Semester

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ENGINEERING COLLEGE AJMER

1
CERTIFICATE

This is to certify that the Seminar Report entitled Sentiment


Analysis: Opinion Mining has been submitted by Kartik Yadav in
partial fulfillment of the degree of B.Tech. in Computer Science &
Engineering for the academic session 2023-2024.

He has undergone the requisite work as prescribed by Bikaner


Technical University, Bikaner (Rajasthan).

Mr. H.R.Choudhary Dr. Prakriti Trivedi Dr.Jyoti Gajrani

Supervisor Seminar Coordinator HOD

Place: Ajmer

Date: 29 November 2023

2
ACKNOWLEDGEMENT

This is the opportunity to express my heartfelt words for the


people who were part of this seminar in numerous ways, people
who gave me unending support right from beginning of this
seminar.

I want to give sincere thanks to the Principal Dr. Rekha Mehra for
her valuable support.

I extend my thanks to Dr. Jyoti Gajrani, Head of the Department


for her constant support.

I express my deep sense of gratitude for continuous cooperation &


encouragement towards my guide Mr. H.R. Choudhary Sir.

Kartik Yadav

20EEACS047

3
TABLE OF CONTENT

CHAPTER NO. TITLE PAGE NO.


CERTIFICATE
ACKNOWLEDGEMENT
ABSTRACT 1
INTRODUCTION
1.1 Overview of Sentiment Analysis 7
1.2 Motivation to Work 9
1
1.3 Problem Statement 9

1.4 Goal of Report 16


1.5 Need of Sentiment Analysis 16
LITERATURE OVERVIEW
2 2.1 Existing Methods 19
DESIGN
3 3.1 Structural Design 22
3.2 UML Design 22
IMPLEMENTATION
4.1 Software Requirements 28
4 4.1 Hardware Requirements 28
4.3 Sample Code 28
4.4 Input Output 38
5 CONCLUSION & FUTURE WORK 40
6 REFERENCES 21

4
LIST OF FIGURES

FIGURE NO. TITLE


1.3.2.1.1 Extraction of Data
1.3.2.2.1 Architectural Design Using Naive Bayes
Naive Bayes
1.3.2.2.2
1.2.2.2.3 Architectural Design
3.1.1 Structural Chart
3.2.1.1 Use Case Diagram
3.2.2.1 Sequence Diagram
3.2.2.2 Sequence Diagram
3.2.3.1 Activity Diagram
3.2.4.1 Collaboration Diagram
4.3.1 Result for Sentiment Analysis of Twitter Data

5
6
ABSTRACT

It is challenging to understand the latest trends and summarize the state or


general opinions about products due to the big diversity and size of social media
data, and this creates the need for automated and real time opinion extraction and
mining. Mining opinion is a sentiment analysis that is treated as a difficult text
classification task which uses natural language processing for text analysis to
systematically identify, extract, quantify, and study affective states and
subjective information. In this, we explore the role of text pre-processing in
sentiment analysis, and report on experimental results that demonstrate that with
appropriate feature selection and representation, sentiment analysis accuracies
using support vector machines (SVM) in this area may be significantly
improved. Sentiment analysis is considered to be a much harder problem in the
literature but these approaches for detecting sentiment in figurative and
metaphorical language—these are the areas where we expect to see significant
work in the near future.

The goal of this report is to classify review data into sentiments (positive or
negative) by using different supervised machine learning classifiers on data
collected for different Indian political parties and to show which political party
is performing best for the public. We also concluded which classifier gives
more accuracy during classification.

7
CHAPTER 1 INTRODUCTION

1.1 Overview of Sentiment Analysis

Sentiment is an attitude, thought or judgment prompt by feeling. It is also known as


opinion mining which is the study of people’s sentiment or emotion towards any
entity. Many people share their reviews or opinion about products and services online
nowadays. These opinions became a part of the decision making process to make an
impact on business models. Also understanding and considering reviews, will help us
to gain trust of customers which help to expand business. Thus we need to study
sentimental analysis of the customer's review.

Sentiment analysis provides some answers into what the most important issues are,
from the perspective of customers, at least. Because sentiment analysis can be
automated, decisions can be made based on a significant amount of data rather than
plain intuition that isn’t always right.

It is the process of identifying human emotions, most typically from facial


expressions as well as from verbal expressions. It relies on a deeper analysis of
human emotions and sensitivities. The technology, also referred to as emotional
analytics, provides insights into how a customer perceives a product, the presentation
of a product or their interactions with a customer service representative.

It can be done at document, phrase and sentence level. At document level, summary
of the entire document is taken first and then it is analyzed whether the sentiment is
positive, negative or neutral. In phrase level, analysis of phrases in a sentence is taken
into account to check the polarity. In Sentence level, each sentence is classified in a
particular class to provide the sentiment. Sentimental Analysis has various
applications. It is used to generate opinions for people on social media by analyzing
their feelings or thoughts which they provide in the form of text. Sentiment Analysis
is domain centered, i.e. results of one domain cannot be applied to other domains.
Sentimental Analysis is used in many real life scenarios, to get reviews about any
product or movies, to get the financial report of any company, for predictions or
marketing.

Using machine learning techniques and natural language processing we can extract
the subjective information of a document and try to classify it according to its polarity

8
such as positive, neutral or negative. It is a really useful analysis since we could
possibly determine the overall opinion about a selling object, or predict stock markets
for a given company like, if most people think positively about it, possibly its stock
markets will increase, and so on..

Effective business strategies can be built from results of sentiment and emotion
analysis. Identifying clear emotions will establish a transparent meaning of text which
potentially develops customer relationships, motivation and extends consumer
expectations towards a brand or service or a product.

Just as with other data related to customer experience, emotions data is used to create
strategies that will improve the business's customer relationship management (CRM).
Sentimental analysis software programs can be used with companies' data collection,
data classification, data analytics and data visualization initiatives to find out the
hidden sentiment from the text which can help them to find out the areas of
improvement as well as the changes that can lead them to grow their business with
the customer satisfaction.

Twitter is a micro blogging platform where anyone can read or write short forms of
messages which are called tweets. The amount of data accumulated on twitter is very
huge. This data is unstructured and written in natural language. Twitter Sentiment
Analysis is the process of accessing tweets for a particular topic and predicts the 1
sentiment of these tweets as positive, negative or neutral with the help of different
machine learning algorithms

Following are the basic for sentiment determination-


● Firstly, evaluative terms expressing opinions must be extracted from the review.
● Secondly, the SO, or the polarity, of the opinions must be determined.
● Thirdly, the opinion strength, or the intensity, of an opinion should also be
determined.
● Finally, the review is classified with respect to sentiment classes, such as
Positive and Negative, based on the SO of the opinions it contains.

9
1.2 Motivation to Work

Businesses primarily run over customer satisfaction,customer reviews about their


products. Shifts in sentiment on social media have been shown to correlate with shifts
in stock markets.Identifying customer grievances thereby resolving them leads to
customer satisfaction as well as trustworthiness of an organization.Hence there is a
necessity of an unbiased automated system to classify customer reviews regarding
any problem.

In today’s environment where we’re justifiably suffering from data overload


(although this does not mean better or deeper insights), companies might have
mountains of customer feedback collected; but for mere humans, it’s still impossible
to analyze it manually without any sort of error or bias.

Oftentimes, companies with the best intentions find themselves in an insights


vacuum. You know you need insights to inform your decision making and you know
that you’re lacking them, but don’t know how best to get them. Sentiment analysis
provides some answers into what the most important issues are, from the perspective
of customers, at least. Because sentiment analysis can be automated, decisions can be
made based on a significant amount of data rather than plain intuition that isn’t
always right.

1.3 Problem Statement

Generating statistical information regarding emotions, sentiments out of analysis of


user’s opinions from tweets, which can be used as an inference to understand how
users feel thereby improving users experiences regarding. Despite the availability of
software to extract data regarding a person’s sentiment on a specific product or
service, organizations and other data workers still face issues regarding the data
extraction. With the rapid growth of the World Wide Web, people are using social
media such as Twitter which generates big volumes of opinion texts in the form of
tweets which are available for sentiment analysis. This translates to a huge volume of
information from a human viewpoint which makes it difficult to extract a sentence,
read them, analyze tweet by tweet, summarize them and organize them into an
understandable format in a timely manner.

10
1.3.1 Objective
The main objective of this report work is to perform sentiment analysis on text of
social media platforms such that people's opinions about products, services, policies
etc. are extracted from these online platforms. Thus to achieve this objective we build
a classifier based on supervised learning and perform live sentiment analysis on data
collected from different political parties.

1.3.2 Methodology

The sentiment analysis of Twitter data is an emerging field that needs much more
attention.We use Tweepy an API to stream live tweets from Twitter.User based on his
interest chooses a keyword and tweets containing that keyword are collected and
stored into a csv file.Then we make it a labeled dataset using textblob and setting the
sentiment fields accordingly.Thus our train data set without preprocessing is
ready.Next we perform preprocessing to clean,remove unwanted text,characters out
of the tweets.Then we train our classifier by fitting the train data to the classifier
,there after prediction of results over unseen test data set is made which there after
provides us with the accuracy with which the classifier had predicted the
outcomes.There after we present our results in a pictorial manner which is the best
way to showcase results because of its easiness to understand information out of it.

1.3.2.1 Proposed System:

Extraction of Data
Tweets based on a keyword of the user's choice of interest have been collected using a
famous twitter API known as Tweepy and stored into a csv file.This data set collected
for sentiment analysis have tweets based on a keyword e.g.,cybertruck. Tweets
mimicking various emotions as a dataset downloaded from kaggle is used for
emotional analysis.Since both the machines are trained using supervised learning and
work on different parameters different data sets have been considered.
In order to extract the opinion first of all data is selected and extracted from twitter in
the form of tweets. After selecting the data set of the tweets, these tweets were
cleaned from emoticons, unnecessary punctuation marks and a database was created
to store this data in a specific transformed structure. In this structure, all the
transformed tweets are in lowercase alphabets and are divided into different parts of
tweets in the specific field. The details about the steps adopted for the transformation
of information are described in next subsections.
11
Fig: 1.3.2.1.1 Extraction of Data

Processing of Data :

Removing Html tags and urls:


Html tags and urls often have minimum sentiments thus they are removed from
tweets.Using regular expressions.

Conversion to lowercase:
To maintain uniformity all the tweets are converted to lowercase .This will benefit to
avert inconsistency in data.Python provides a function called lower() to convert
sentences to lower case.

Tokenization:
Tokenization is the process of converting text into tokens before transforming it into
vectors. It is also easier to filter out unnecessary tokens. For example, a document
into paragraphs or sentences into words. In this case we are tokenizing the reviews
into words.

Removing punctuations and special symbols:


Apart from the considered set of emoticons, punctuations and symbols like &,\,; are
removed.

12
Stemming and Lemmatization:
Sentences are always narrated in tenses,singular and plural forms making most
words accompanied with -ing,-ed,es and ies. Therefore,extracting the root word will
suffice to identify sentiment behind the text.

Base forms are the skeleton for grammar stemming and lemmatization reduces
inflectional forms and derivational forms to common base forms .
Example: Cats are reduced to cats ,ponies are reduced to poni.

Feature Extraction:
Text data demands a special measure before you train the model.Words after
tokenization are encoded as integers or floating point values for feeding input to
machine learning algorithms. This practice is described as vectorization or feature
extraction. Scikit-learn library offers TF-IDF vectorizer to convert text to word
frequency vectors.
Fitting Data to Classifier and predicting test data:
Train data is fitted to a suitable classifier upon feature extraction ,then once the
classifier is trained enough then we predict the results of the test data using the
classifier,then compare the original value to the value returned by the classifier.

Result Analysis:
Here the accuracy of different classifiers are shown among which the best classifier
with highest accuracy percent is chosen. Some factors such as f-score,mean,variance
etc., also accounts for consideration of the classifiers.
Visual Representation:
Our final results are plotted as pie charts which contain different fields such as
positive,negative,neutral in case of sentiment analysis.where as happy,sad,joy etc., in
case of emotional analysis. Pictorial representation is the best way to convey
information without much efforts.Thus it is chosen.
13
1.3.2.2 System Architecture:

Fig:1.3.2.2.1 Architecture diagram for sentiment analysis using Naive Bayes

14
Naive Bayes Algorithm :
Naive Bayes algorithm which is based on well known Bayes theorem which is
mathematically represented as

Where,
A and B are events
P(A/B) is the likelihood of event A given that event B is true and has
happened,Which is known to be as posterior probability .
P(A) is the likelihood of an event A being true,Which is known to be a prior
probability.
P(B/A) is the likeliness of happening of an event B given A was true ,Which is known
to be as Likelihood.
P(B) is the likelihood of happening of an event B,Which is known to be as Evidence .

This is a classification method that relies on Bayes' Theorem with strong (naive)
independence assumptions between the features. A Naive Bayes classifier expects
that the closeness of a specific feature (element) in a class is disconnected to the
closeness of some other elements. For instance, an organic fruit might be considered
to be an apple if its color is red, its shape is round and it measures approximately
three inches in breadth. Regardless of whether these features are dependent upon one
another or upon the presence of other features, a Naïve Bayes classifier would
consider these properties independent due to the likelihood that this natural fruit is
an apple. Alongside effortlessness, the Naive Bayes is known to out-perform even
exceedingly modern order strategies. The Bayes hypothesis is a method of
computing for distinguishing likelihood P(a|b) from P(a), P(b) and P(b|a) as follows:
p(a|b) = [p(b|a) * p(a)] / p(b)
Where p(a|b ) is the posterior probability of class a given predictor b and p(b|a ) is the
likelihood that is the probability of predictor b given class a.
The prior probability of class a is denoted as p(a ), and the prior probability predictor p
is denoted as p(b).
The Naive Bayes is widely used in the task of classifying texts into multiple classes and
was recently utilized for sentiment analysis classification.

15
Fig: 1.3.2.2.2 Naive Bayes Classifier

Fig: 1.2.2.2.3 Architecture diagram for Emotion analysis using Support


Vector Machines

16
1.4 Goal of Report
With the emergence of social networking, many websites have evolved in the past
decade like Twitter, Facebook, Tumbler, etc. Twitter is one website which is widely
used all over the world. According to Twitter it has been recorded that around 200
billion tweets posts every year. Twitter allows people to express their thoughts,
feelings, emotions, opinions, reviews, etc. about any topic in natural language within
140 characters. Python is the standard high-level programming language which is
best for NLP. Thus, for processing natural language data, Python uses one of its
libraries called Natural Language Toolkit. NLTK provides large amount of corpora
which helps in training classifiers and it helps in performing all NLP methodology
like tokenizing, part-of-speech tagging, stemming, lemmatizing, parsing and
performing sentiment analysis for given datasets.
It is a challenging task to deal with a large dataset, but with the use of NLTK we can
easily classify our data and give more accurate results based on different classifiers.
The goal of this thesis is to perform sentiment analysis on different reviews of social
media platforms and surveys. Public opinions of these parties are mined from Twitter
and then classified into sentiments, whether positive or negative by using supervised
machine learning classifiers. These results will let us know about the reviews and
opinions of people on these political parties.

To achieve this goal, a module is created which can perform live sentimental analysis.
In live sentimental analysis users can obtain the trend of any live trending topic
depicted by two sentiment categories (positive and negative) in live graphs. Further
accuracy and reliability of the module can be checked with the help of various
machine learning classifiers.
In this thesis we work on different political parties because in our country politics
plays a very vital role. Winning an election by any party is different from how that
party works after winning.

1.5 Need Of Sentiment Analysis

1.5.1 Industry Evolution


Only the useful amount of data is required in the industry as compared to the set of
complete unstructured form of the data. However the sentiment analysis done is
17
useful for extracting the important feature from the data that will be needed solely for
the purpose of industry. Sentimental Analysis will provide a great opportunity to the
industries for providing value to their gain value and audience for themselves. Any of
the industries with the business to consumer will get benefit from this whether it is
restaurants, entertainment, hospitality, mobile customer, retail or being travel.

1.5.2 Decision Making


Every person who stores information on the blogs, various web applications and the
web social media, social websites for getting the relevant information you need a
particular method that can be used to analyze data and consequently return some of
the useful results. It is going to be very difficult for companies to conduct the survey
that will be on the regular basis so that there comes the need to analyze the data and
locate the best of the products that will be based on user’s opinions, reviews and
advice. The reviews and the opinions also help the people to take important decisions
helping them in research and business areas.

18
CHAPTER 2 LITERATURE REVIEW

"What other people think” has always been an important piece of information for
most of us during the decision-making process. The Internet and the Web have now
(among other things) made it possible to find out about the opinions and experiences
of those in the vast pool of people that are neither our personal acquaintances nor
well-known professional critics — that is, people we have never heard of. And
conversely, more and more people are making their opinions available to strangers via
the Internet. The interest that individual users show in online opinions about products
and services, and the potential influence such opinions wield, is something that is
driving force for this area of interest. And there are many challenges involved in this
process which needs to be walked all over inorder to attain proper outcomes out of
them.

P. Pang, L. Lee, S. Vaithyanathan et al [8]


They were the first to work on sentiment analysis. Their main aim was to classify text
by overall sentiment, not just by topic e.g., classifying movie reviews either positive
or negative. They apply machine learning algorithms on movie review databases
which results that these algorithms out-perform human produced algorithms. The
machine learning algorithms they use are Naïve-Bayes, maximum entropy, and
support vector machines. They also conclude by examining various factors that
classification of sentiment is very challenging. They show supervised machine
learning algorithms are the base for sentiment analysis.

E. Loper, S. Bird et al [10]


Natural Language Toolkit (NLTK) is a library which consists of many program
modules, large sets of structured files, various tutorials, problem sets, many statistics
functions, ready-to-use machine learning classifiers, computational linguistics
courseware, etc. The main purpose of NLTK is to carry out natural language
processing, i.e. to perform analysis on human language data. NLTK provides corpora
which are used for training classifiers. Developers create new components and
replace them with existing components, more structured programs are created and
more sophisticated results are given by dataset.

19
O. Almatrafi, S. Parack, B. Chavan et al [12]
They are the researchers who proposed a system based on location. According to
them, Sentiment Analysis is carried out by Natural Language Processing (NLP) and
machine learning algorithms to extract a sentiment from a text unit which is from a
particular location. They study various applications of location based sentiment
analysis by using a data source in which data can be extracted from different
locations easily. In Twitter, there is a field of tweet location which can easily be
accessed by a script and hence data (tweets) from particular locations can be collected
for identifying trends and patterns. In their research they work on Indian general
elections 2014. They perform mining on 600,000 tweets which were collected over a
period of 7 days for two political parties. They apply a supervised machine learning
approach, like Naïve-Bayes algorithm, to build a classifier which can classify the
tweets in either positive or negative. They identify the thoughts and opinions of users
towards these two political parties in different locations and they plot their findings
on India map by using a Python library.

B. Sun, V. Ng, et al [16]


Many efforts have been done to gather information from social networks to perform
sentiment analysis on internet users. Their aim is to show how sentimental analysis
influences social network posts and they also compare the result on various topics on
different social-media platforms. Large amount of data is generated every day, people
are also very curious in finding other similar people among them. Many researchers’
measures the influence of any post through the number of likes and 17 replies it
received but they are not sure whether the influence is positive or negative on other
posts. In their research some questions are raised and new methodologies are
prepared for the sentimental influence of post.

2.2 Existing Methods


2.2.1 Using a Heterogeneous Dataset for Emotion Analysis in Text :
A supervised machine learning approach was adopted to recognize six basic emotions
(anger, disgust, fear, happiness, sadness and surprise) using a heterogeneous
emotion-annotated dataset which combines news headlines, fairy tales and blogs. For
this purpose, different feature sets, such as bags of words, and N-grams, were used.
The Support Vector Machines classifier (SVM) performed significantly better than

20
other classifiers, and it generalized well on unseen examples.
Five datasets were considered to compare among various approaches. In the bag of
words Each sentence in the dataset was represented by a feature vector composed of
Boolean attributes for each word that occurs in the sentence. If a word occurs in a
given sentence, its corresponding attribute is set to 1; otherwise it is set to 0. In N
grams approach they are defined as sequences of words of length n. N-grams can be
used for catching syntactic patterns in text and may include important text features
such as negations, e.g., “not happy”. Negation is an important feature for the analysis
of emotion in text because it can totally change the expressed emotion of a sentence.
The author concludes some research studies in sentiment analysis claimed that
N-grams features improve performance beyond the BOW approach.

2.2.2 Multiclass Emotional Analysis on Social Media Posts:


The author conveys that among the models they have built SVM has outperformed
with greatest accuracy. After considering around 13,000 examples per emotion they
had split 63%for training set, a hold out cross validation set (27%), and a final test set
(10%).The first model trained and optimized for the task was Multinomial Naive
Bayes. The model gave very good results, even as a baseline,Whereas a random
classifier would have performed with 6.6% accuracy on 15 classes, this Naive Bayes
model was performing at 28.01% accuracy. After Multinomial Naive Bayes, they
trained and optimized Softmax Regression models. While optimizing the Softmax
Regression model, several checks on pre- processing techniques were conducted. One
of the conclusions drawn was that stemming was actually decreasing the accuracy of
the models. The next model trained was a linear SVM. This model was trained with
both tf-idf vectors and count vectors. To reduce the number of features to prevent
overfitting, PCA was given a shot on document vectors.
Later, they tried training a kernel SVM with RBF, however the size of the data made
it impossible for a regular computer to train the model in reasonable time. Thus they
trained a ν-SVM instead with the kernel trick. Results proved that SVM (linear
kernel) was maintaining greatest accuracy.

2.2.3 Classification of Emotions from text using SVM based Opinion Mining :
SVM classification using Quadratic programming was used. Steps included preparing
the data set ,annotating the dataset with predefined emotions , using NLP, preparing
the database matrix of test emotions and training emotions, classifying the training set
with a support vector machine using quadratic programming algorithm. Compute the

21
prediction of the support vector machine using kernel function and its parameter for
classification and finally compute the accuracy of the classification. The basic idea of
SVM is to find the optimal hyperplane to separate two classes with the largest margin
of pre-classified data. After this hyperplane is determined it is used for classifying
data into two classes based on which side they are located. By applying appropriate
transformations to the data space after computing the separating hyperplane, SVM
can be extended to cases where the margin between two classes is non-linear. Finally,
on classifying the data set ,superior results have been obtained.

2.2.4 Emotion Detection and Analysis on Social Media:


In this two approaches were followed NLP and Machine Learning. In the machine
learning approach the author reveals that for generation of the training set a set of
seed words were chosen, composed of commonly used Emotion words and emoticons
from the EWS, evenly distributed over all the Emotion- Categories. Then Tweepy
was queried using these seed words to develop a huge database of around 13,000
tweets. The seed words are used to ensure that we get tweets that express at least one
of the six emotions. Later filtering and labeling of the tweets was performed to
eliminate tweets that convey mixed emotions. Only those tweets which have a
percentage of more than 70% for a particular emotion are labeled and fed to the
classifier for training.
For training the classifier an open source library weka was used. Before classification
the pre-processing steps like stop-word filtering, lower casing all words and
Stemming each word using Weka’s Snowball Stemmer were applied. The results of
the NLP approach and ML approach are combined. The scores which are generated
by the first approach are modified, according to the Labeled- Category of the
classifier. The Emotion-Category with the maximum final score is decided as the
final Emotion-Category of the tweet (or the piece of text).

22
CHAPTER 3 DESIGN

3.1 Structural Design:

Fig: 3.1.1 Structural Chart

3.2 UML Design:

A UML diagram is a partial graphical representation (view) of a model of a system


under design, implementation, or already in existence. The UML diagram contains
graphical elements (symbols) - UML nodes connected with edges (also known as
paths or flows) - that represent elements in the UML model of the designed system.
The UML model of the system might also contain other documentation such as use
cases written as templated texts.
The kind of the diagram is defined by the primary graphical symbols shown on the
diagram. For example, a diagram where the primary symbols in the contents area are
classes is a class diagram. A diagram which shows use cases and actors is a use case
diagram. A sequence diagram shows sequence of message exchanges between
lifelines.
23
UML specification does not preclude mixing of different kinds of diagrams, e.g. to
combine structural and behavioral elements to show a state machine nested inside a
use case. Consequently, the boundaries between the various kinds of diagrams are not
strictly enforced. At the same time, some UML Tools do restrict the set of available
graphical elements which could be used when working on a specific type of diagram.
UML specification defines two major kinds of UML diagrams: structure diagrams
and behavior diagrams.
Structure diagrams show the static structure of the system and its parts on different
abstraction and implementation levels and how they are related to each other. The
elements in a structure diagram represent the meaningful concepts of a system, and
may include abstract, real world and implementation concepts.
Behavior diagrams show the dynamic behavior of the objects in a system, which can
be described as a series of changes to the system over time.

3.2.1 Use Case Diagram :

Fig: 3.2.1.1 Use Case Diagram

24
3.2.2 Sequence Diagram:

Fig: 3.2.2.1 Sequence Diagram for Sentiment Analysis(1)

25
Fig: 3.2.2.2 Sequence Diagram for Sentiment Analysis(II)

26
3.2.3 Activity Diagram:

Fig: 3.2.3.1 Activity Diagram

27
3.2.4 Collaboration Diagram:

Fig: 3.2.4.1 Collaboration Diagram

28
4.1 Software Requirement
Following are the software and modules that needs to be installed for successful
execution of the project.They are:
● Anaconda
● Spyder
● Jupyter NoteBook
● Nltk
● Scikit-learn
● Matplotlib
● Tweepy
● Pandas
● Numpy
● TextBlob
● VaderSentiment
● Csv
● Re(Regular Expressions)
● Windows

4.2 Hardware Requirement


Following are the hardware requirements necessary for faster execution of the code.
● A minimum of Intel Core I3 processor
● A minimum of 4 GB Ram
● Cpu with at least 2 cores of clock speeds > 1.5GHz

4.3 Sample Code


import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener import json
import pandas as pd
imp import re #regular expression #from textblob import
TextBlob import string
import preprocessor as port csv
import os
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
ItemId=0
Sentiment=0
pos=0
neg=0
29
from nltk.corpus import stopwords
from nltk.tokenize import
word_tokenize consumer_key =
'cksnY9Jp0jpry0WDtqT7kidyb'
consumer_secret =
'3R3UHvxHma9Q9DTQmZLrsCpGb9fgLuZZkqYK8BK2mKu8qo1iO9'
access_key=
'1169646368903663617-VvzdXw2FtF2ZACfsc2Ww8bTDPpaeji'
access_secret = 'x1w3fVEb497b0Ftf2hyWOfngwB4YNsiFyg16nbYatlKPR'
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_key,access_secret)
api=tweepy.API(auth)

analyzer=SentimentIntensityAnalyzer()
hacking_tweets=r"C:\Users\saile\OneDrive\Desk
top\cyber.csv" COLS = ['ItemId','Sentiment',
'SentimentText']
start_date = '2019-10-01'
end_date = '2019-10-31'

# Happy Emoticons
emoticons_happy = set([
':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b',
'>:)', '>;)', '>:-)', '<3'
])

# Sad Emoticons
emoticons_sad = set([
':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<',
":'-(", ":'(", ':\\', ':-c', ':c', ':{', '>:\\', ';('
])

#Emoji patterns
'''emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols
& pictographs
u"\U0001F680-\U0001F6FF" # transport
& map symbols
u"\U0001F1E0-\U0001F1FF" # flags
(iOS) u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+",
flags=re.UNICODE) '''
30
#combine sad and happy emoticons
emoticons =
emoticons_happy.union(emoticons_sad)
#mrhod clean_tweets()
def clean_tweets(tweet):
stop_words =
set(stopwords.words('english'))
word_tokens =
word_tokenize(tweet)
#after tweepy preprocessing the colon left remain after removing mentions

#or RT sign in the beginning of the tweet


tweet = re.sub(r':', '', tweet)
#tweet = re.sub(r'…', '', tweet)
#replace consecutive non-ASCII characters with a space
tweet = re.sub(r'[^\x00-\x7F]+','', tweet)
tweet=re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*
\(\),
]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',tweet)
tweet= re.sub(r'#([A-Za-z0-9_]+)','',tweet)

#remove emojis from tweet


tweet = emoji_pattern.sub(r'', tweet)
#filter using NLTK library append it to a string
filtered_tweet = [w for w in word_tokens if not w in stop_words]
filtered_tweet = []
#looping through conditions
for w in word_tokens:
#check tokens against stop words , emoticons and punctuations

if w not in stop_words and w not in emoticons and w not in string.punctuation:


filtered_tweet.append(w)
return ' '.join(filtered_tweet)
#print(word_tokens)

def write_tweets(keyword, file,ItemId,pos,neg):


# If the file exists, then read the existing data
from the CSV file. if os.path.exists(file):
df = pd.read_csv(file, header=0) else:
df = pd.DataFrame(columns=COLS)
#page attribute in tweepy.cursor and iteration
for page in tweepy.Cursor(api.search,
q=keyword, count=200, include_rts=False,
since=start_date).pages(10): for status in
page:
new_entry =
[] status =
status._json
31
if status['lang'] != 'en':
continue
vs=analyzer.polarity_scor
es(status['text']) if
vs['compound']>=0.5:
Sentiment=1
pos=pos+1
else:
Sentiment=0
neg=neg+1
ItemId=ItemId+1

#new entry append


new_entry += [ItemId,Sentiment,status['text']] #to append
original author of the tweet
single_tweet_df = pd.DataFrame([new_entry],
columns=COLS) df =
df.append(single_tweet_df, ignore_index=True)
csvFile = open(file, 'w' ,encoding='utf-8')
df.to_csv(csvFile, mode='w', columns=COLS, index=False,
encoding="utf-8") #declare keywords as a query for three categories
cyber_words="#cybertruck -filter:retweets"

#call main method passing keywords and file path


write_tweets(cyber_words,hacking_tweets,ItemId,pos,neg)
#C:\Users\saile\OneDrive\Desktop\cyber1.csv
import sys
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import
word_tokenize from
nltk.corpus import
stopwords
train_data =pd.read_csv(r'C:\Users\saile\OneDrive\Desktop\cyber.csv')
rand_indexs = np.random.randint(1,len(train_data),50).tolist()
train_data["SentimentText"][rand_indexs]
tweets_text =
train_data.SentimentText.str.cat()
emos = set(re.findall(r"
([xX:;][-']?.) ",tweets_text))
emos_count = []
for emo in emos:
emos_count.append((tweets_text.co
unt(emo), emo))
32
sorted(emos_count,reverse=True)
HAPPY_EMO = r" ([xX;:]-?[dD)]|:-?[\)]|[;:][pP]) "

SAD_EMO = r" (:'?[/|\(]) "


print("Happy emoticons:", set(re.findall(HAPPY_EMO,
tweets_text))) print("Sad emoticons:",
set(re.findall(SAD_EMO, tweets_text)))
# Uncomment this line if you haven't
downloaded punkt before # or just run it as it is
and uncomment it if you got an error.
#nltk.download('punkt')
def most_used_words(text):
tokens =
word_tokenize(text)
frequency_dist =
nltk.FreqDist(tokens)
print("There are %d different words" % len(set(tokens)))
return sorted(frequency_dist,key=frequency_dist. getitem , reverse=True)

most_used_words(train_data.SentimentTex
t.str.cat())[:100] # In[11]:
mw = most_used_words(train_data.SentimentText.str.cat())
most_words = []
for w in mw:
if len(most_words) == 1000: break
if w in stopwords.words("english"): continue
else:
most_words.append(w)

sorted(most_words) #
In[12]:
from nltk.stem.snowball import
SnowballStemmer from nltk.stem
import WordNetLemmatizer
def stem_tokenize(text):
stemmer =
SnowballStemmer("englis
h") stemmer =
WordNetLemmatizer()
return [stemmer.lemmatize(token) for token in word_tokenize(text)]

def lemmatize_tokenize(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(token) for token in word_tokenize(text)]

# In[13]:
from sklearn.feature_extraction.text import
TfidfVectorizer from sklearn.base import

33
TransformerMixin, BaseEstimator from
sklearn.pipeline import Pipeline
class
TextPreProc(BaseEstimator,Transfor
merMixin): def init (self,
use_mention=False):
self.use_mention = use_mention
def fit(self, X, y=None):
return self

def transform(self, X, y=None):


if self.use_mention:
X = X.str.replace(r"@[a-zA-Z0-9_]* ", " @tags ") else:
X=
X.str.replace(r"@[a-zA-Z0-
9_]* ", "") X =
X.str.replace("#", "")
X = X.str.replace(r"[-\.\n]", "")
X = X.str.replace(r"&\w+;", "")

# Removing links
X = X.str.replace(r"https?://\S*", "")
# replace repeated letters with only two
occurences # heeeelllloooo => heelloo
X = X.str.replace(r"(.)\1+", r"\1\1") # mark
emoticons as happy or sad
X = X.str.replace(HAPPY_EMO, " happyemoticons ")
X = X.str.replace(SAD_EMO, " sademoticons ")

X = X.str.lower() return X

# In[14]:
from sklearn.model_selection import
train_test_split from sklearn.ensemble
import RandomForestClassifier from
sklearn.model_selection import
cross_val_score from sklearn.metrics
import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB,
MultinomialNB from sklearn import tree
sentiments = train_data['Sentiment']
tweets = train_data['SentimentText']

vectorizer = TfidfVectorizer(tokenizer=lemmatize_tokenize,
ngram_range=(1,2)) pipeline = Pipeline([
('text_pre_processing',
TextPreProc(use_mention=True)), ('vectorizer',
34
vectorizer),
])
learn_data, test_data, sentiments_learning, sentiments_test = train_test_split(tweets,
sentiments, test_size=0.3)
learning_data =
pipeline.fit_transform(learn_data)
lr = LogisticRegression()
bnb = BernoulliNB() mnb =
MultinomialNB()
clf = tree.DecisionTreeClassifier()
clf1 = RandomForestClassifier(n_estimators=10) models =
{
'logitic regression': lr,
'bernoulliNB': bnb,
'multinomialNB': mnb,
}
for model in models.keys():
scores = cross_val_score(models[model], learning_data, sentiments_learning,
scoring="f1", cv=10)
print("===", model, "===")
print("scores = ", scores)
print("mean = ", scores.mean()) print("variance = ",
scores.var())
models[model].fit(learning_data, sentiments_learning)

print("score on the learning data (accuracy) = ",


accuracy_score(models[model].predict(learning_data),
sentiments_learning)) print("")

# In[9]:
from sklearn.model_selection import GridSearchCV
grid_search_pipeline = Pipeline([ ('text_pre_processing',
TextPreProc()),
('vectorizer', TfidfVectorizer()), ('model', MultinomialNB()),
])
params = [
{
'texttext_pre_processing use_mention': [True, False], 'vectorizer max_features': [1000,
2000,
5000, 10000, 20000, None],'vectorizer ngram_range': [(1,1), (1,2)],
},]
grid_search = GridSearchCV(grid_search_pipeline, params,
cv=5, scoring='f1') grid_search.fit(learn_data,
sentiments_learning) print(grid_search.best_params_)

# In[9]:
mnb.fit(learning_data, sentiments_learning)

35
# In[15]:
testing_data = pipeline.transform(test_data)
mnb.score(testing_data, sentiments_test)

# In[12]:
# Data to plot pos=124 neg=166
labels = 'Positive','Negative' sizes = [pos,neg]
colors = ['gold', 'lightcoral']
explode = (0.1,0)
# explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels,
colors=colors, autopct='%1.1f%%',
shadow=True, startangle=140) plt.axis('equal')
plt.show()

import pandas as pd
import numpy as np
import nltk
import re import
itertools import
time
import
matplotlib.pyp
lot as plt
start_time =
time.time()
import os
data =

36
pd.read_csv(r'C:\Users\saile\OneDrive\Desktop\project\text_emotion.csv')
sizes1=[]
from nltk.stem.wordnet import
WordNetLemmatizer lem =
WordNetLemmatizer()
def cleaning(text):
txt = str(text)
txt = re.sub(r"http\S+", "",
txt) if len(txt) == 0:
return 'no text' else:
txt = txt.split() index = 0
for j in range(len(txt)):
if txt[j][0] == '@':
index = j
txt = np.delete(txt,
index) if len(txt) ==
0:
return 'no
text' else:
words = txt[0]
for k in range(len(txt)-1):
words+= " " +
txt[k+1] txt = words
txt = re.sub(r'[^\w]', ' ',
txt) if len(txt) == 0:
return 'no
text' else:
txt = ''.join(''.join(s)[:2] for _, s in itertools.groupby(txt))
txt = txt.replace("'", "")
txt =
nltk.tokenize.word_tokenize(txt) for
j in range(len(txt)):

37
txt[j] = lem.lemmatize(txt[j],
"v") if len(txt) == 0:
return 'no
text' else:
return txt
data['content'] = data['content'].map(lambda x:
cleaning(x)) data = data.reset_index(drop=True)
for i in range(len(data)):
words = data.content[i][0]
for j in range(len(data.content[i])-1):
words+= ' ' +
data.content[i][j+1]
data.content[i] = words
from sklearn.feature_extraction.text import
TfidfVectorizer from sklearn.metrics import
classification_report
from sklearn import svm
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data.content, data.sentiment, test_size=0.25,
random_state=0)
x_train = x_train.reset_index(drop = True)
x_test = x_test.reset_index(drop = True)
y_train = y_train.reset_index(drop = True)
y_test = y_test.reset_index(drop = True)
vectorizer = TfidfVectorizer(min_df=3,
max_df=0.9) train_vectors =
vectorizer.fit_transform(x_train) test_vectors =
vectorizer.transform(x_test)
model = svm.SVC(kernel='linear')
model.fit(train_vectors, y_train)
predicted_sentiment =
model.predict(test_vectors)
report=(classification_report(y_test, predicted_sentiment,output_dict=True))
df=pd.DataFrame(report).transpose()
df.to_csv('classification_report.csv',index=False)
sizes=df['support'].toli
st() for i in range(13):
sizes1.append(int(i))
labels=['Anger','Boredom','Empty','Enthusiasm','Fun','Happiness','Hate','Love','Neutral','Relief','S
adness','Surprise','Worry']
colors=['Red','yellowgreen','lightcoral','orange','gold','purple','black','pink','brown','green','blue','m
aroon','bluegreen']
plt.pie(sizes1, explode=None, labels=labels, colors=None,
autopct='%1.1f%%', shadow=True, startangle=None)

38
plt.axis('equal')
plt.show()
predicted_sentiments
= []
for s in range(len(predicted_sentiment));
predicted_sentiments.append(predicted_sentiment[s])
prediction_df=pd.DataFrame({'Content':x_test,
'Emotion_predicted':predicted_sentiment, 'Emotion_actual': y_test})
prediction_df.to_csv('emotion_recognizer_svm.csv', index = False)
elapsed_time = time.time() - start_time
print ("processing time:", elapsed_time, "seconds")

4.4 Input Output


Table 4.4.1 Input data

Reviews Class

foolish, idiotic and boring it's so lad dish and Negative


youngish , only teenagers could find it funny

the rock is destined to be the 21st century's POSITIVE


new conan and that he's going to make a
splash even greater than arnold
schwarzenegger

Barry Sonnenfeld owes frank the pug big time NEGATIVE


the biggest problem with roger avary's uproar
against the map

the seaside splendor and shallow , beautiful POSITIVE


people are nice to look at while you wait for
the story to get going

Table 4.4.2 Sample Tweet and Processed Tweet

Tweet Type Result

Original Tweet @xyz I think Kejriwal is a habitual liar, even


where he don’t needs to lie he tells a lie
>#AAP

Processed Tweet think, habit, lie, even, don’t, need, tell, angry

39
Table 4.4.3 Sample Cleaned Data

Raw Data Clean Data

@jackstenhouse69 I really liked it, in my Really, liked, opinion, def


opinion it def is :)

:( \u201c@EW: How awful. Police: Sad, awful, police, driver, kills, Driver kills 2,
http://t.co/8GmFiOuZbS\u201d injures 23 at #SXSW injures

Output
Below are the results for sentiment and emotional analysis represented as a pie-chart for users
using matplotlib.

Fig: 4.3.1 Result for Sentiment Analysis of Twitter Data

40
CHAPTER 5 CONCLUSION & FUTURE WORK

We furnished results for Sentiment and Emotional Analysis on twitter data . On


applying Logistic regression, Bernouille Naive Bayes and Multinomial Naive Bayes
for sentiment analysis MultinomialNaive Bayes stands out with 96.4% accuracy at
test_split=0.3.Users topic of interest for sentiment analysis has been considered ,So
that they may get to know the statistics of sentiment behind the topic of their own
interest. We firmly conclude that implementing sentiment analysis and emotional
analysis using these algorithms will help in deeper understanding of textual data
which can essentially serve a potential platform for businesses .

In future work , we aim to handle emoticons , dive deep into emotional analysis to
further detect idiomatic statements .We will also explore richer linguistic analysis
such as parsing and semantic analysis.

Some of future scopes that can be included in our research work are:
● Use of parser can be embedded into the system to improve results.
● A web-based application can be made for our work in future.
● We can improve our system so that we can deal with sentences of multiple
meanings.
● We can also increase the classification categories so that we can get better results.
● We can start work on multi languages like Hindi, Spanish, and Arabic to
provide sentiment analysis to more local.

41
6. REFERENCES

[1] Emma Haddi, Xiaohui Liu, Yong Shin,”The Role of Text Pre-processing
in Sentiment Analysis” Volume 17, 2013
url:https://doi.org/10.1016/j.procs.2013.05.005
[2] Saif M.Mohammad,”9 - Sentiment Analysis: Detecting Valence,
Emotions, and Other Affectual States from Text”, National Research Council
Canada, Ottawa, ON, Canada, 15 April 2016
url:https://doi.org/10.1016/B978-0-08-100508-8.00009-6
[3] H. Tang, S. Tan, X. Cheng, A survey on sentiment detection of reviews,
Expert Systems with Applications 36 (7) (2009) 10760-10773.
url:https://doi.org/10.1016/j.eswa.2009.02.063
[4] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? sentiment classification
using machine learning techniques, in: Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP), 2002.
url:https://doi.org/10.48550/arXiv.cs/020500
[5] T. Wilson, J. Wiebe, P. Hoffmann, Recognizing contextual polarity in
phrase-level sentiment analysis, in: Proceedings of the Human Language
Technology Conference and the Conference on Empirical Methods in Natural
Language Processing (HLT/EMNLP), 2005, pp. 347-354.
url:https://aclanthology.org/H05-1044.pdf
[6] “Support Vector Machines”
[Online], http://scikitlearn.org/stable/modules/svm.html#svm-classification,
Accessed Jan 2016
[7] ] H. Wang, D. Can, F. Bar and S. Narayana, “A system for real-time
Twitter sentiment analysis of 2012 U.S.presidential election cycle”, Proc. ACL
2012 System Demonstration, pp. 115-120, 2012
[8] P. Pang, L. Lee and S. Vaithyanathan, “Thumbs up? sentiment
classification using machine learning techniques”, Proc. ACL-02 conference on
Empirical methods in natural language processing, vol.10, pp. 79-86, 2002
[9] P. Pang and L. Lee, “Opinion Mining and Sentiment Analysis.
Foundation and Trends in Information Retrieval”, vol. 2(1-2), pp.1-135, 2008

42
[10.] E. Loper and S. Bird, “NLTK: the Natural Language Toolkit”, Proc.
ACL-02 Workshop on Effective tools and methodologies for teaching natural
language processing and computational linguistics ,vol. 1,pp. 63-70, 2002
[11.] O. Almatrafi, S. Parack and B. Chavan, “Application of location-based
sentiment analysis using Twitter for identifying trends towards Indian general
elections 2014”. Proc. The 9th International Conference on Ubiquitous
Information Management and Communication,2015.
[12.] L. Jiang, M. Yu, M. Zhou, X. Liu and T. Zhao, “Target-dependent twitter
sentiment classification”, Proc. The 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, vol. 1, pp. 151-160,
2011.
[13.] C. Tan, L. Lee, J. Tang, L. Jiang, M. Zhou and P. Li, “User-level sentiment
analysis incorporating social networks”, Proc. The 17th ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 1397-
1405, 2011.
[14.] A. Pak and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and
Opinion Mining”, vol. 10, pp. 1320-1326, 2010.
[15.] B. Sun and TY. V. Ng, “Analyzing Sentimental Influence of Posts on
Social Networks”, Proc. The 2014 IEEE 18th International Conference on
Computer Supported Cooperative Work in Design, 2014.
[16.] B. Sun and TY. V. Ng, “Analyzing Sentimental Influence of Posts on
Social Networks”, Proc. The 2014 IEEE 18th International Conference on
Computer Supported Cooperative Work in Design, 2014 .
[17.] A. Go, R. Bhayani and L. Huang, “Twitter sentiment classification using
distant supervision”, CS224N Project Report, Stanford, vol.1-12, 2009
[18.] A. Barhan and A. Shakhomirov, “Methods for Sentiment Analysis of Twitter
Messages”, Proc.12th Conference of FRUCT Association, 2012
[19.] T. C. Peng and C. C. Shih, “An Unsupervised Snippet-based Sentiment
Classification Method for Chinese Unknown Phrases without using Reference
Word Pairs”. IEEE/WIC/ACM Int. Conf. on Web Intelligence and Intelligent
Agent Technology, vol. 3, pp. 243-248, 2010.

43

You might also like