Interim Project - Sentiment Analysis of Movie

An
Interim report on
Pre Release & Post Release Sentiment Analysis of

Upcoming Movies
Submitted By
Group No. 8 Batch: APR- 2019 Location: Bengaluru
Group Members
Saurav Suman – BABAPR19053
Neha Tiwary – BABAPR19057
Divya Thomas – BABAPR19018
Anurag Kedia – BABAPR19011
Peehu – BABAPR19071
Research Supervisor
Mr. Deepak Sharma
Great Lakes Institute of Management
Page | 1
Contents
1. Introduction:................................................................................................................................3
2. Scope, Objective & Problem Statement.......................................................................................3
3. Data Source and Description.......................................................................................................5
a) Data Source.............................................................................................................................5
b) Data Description......................................................................................................................7
4. Data Pre-processing.....................................................................................................................8
a) Data Cleaning..........................................................................................................................8
b) Creation of Word Corpuses.....................................................................................................9
c) Extraction and Tokenization....................................................................................................9
d) DTM (Document Term Matrix)...............................................................................................9
5. Exploratory data analysis of the data.........................................................................................10
a) Visualization of retweets.......................................................................................................11
b) Plot of most frequent words in the text..................................................................................11
c) World Cloud of the keywords of the tweets...........................................................................13
d) Word Cloud for Account from which most retweets originate..............................................14
e) Word Cloud for the Location from where most of the tweets originate.................................14
6. Modelling Approach..................................................................................................................15
a. Techniques and software to be used......................................................................................15
b. Model Building......................................................................................................................16
c. Comparison chart for accuracy & AUC value of the all Model for Pre-release datasets........20
d. Comparison chart for accuracy & AUC value of the all Model for Post-release datasets......21
7. Recommendations & Applications............................................................................................21
8. Challenges and Limitations.......................................................................................................21
9. References and Bibliography.....................................................................................................22
10. Appendix...................................................................................................................................23
Page | 2
1. Introduction:
Movies are the most convenient way to entertain people. However, only few movies get higher
success and are ranked high. Many movies are produced by the movie industry in a year. Movie
revenue depends on various components such as cast acting in a movie, budget for the making of the
movie, film critics review, rating for the movie, release year of the movie, etc. Because of these
multiple components there is no formula that helps us to provide analysis for predicting how much
revenue a particular movie will be generating.
However, by analysing the revenues generated by previous movies, a model can be built which can
help us predict the expected revenue for a particular movie.
As we know in today’s world the movie is one of the biggest sources of entertainment and also for
business purposes. If we were able to predict the movie success rate in the correct manner then it
will be easy for the businessman to get higher profit from it and also if the prediction shows the
success rate is low of certain movie then it helps those businessmen to improve the content of the
movie so that they can get higher revenue from it. Success rate of movies, models and mechanisms
can be used to predict the success of a movie.
It will help the business significantly. Stakeholders such as Actors, Producers, directors and Event
Company etc. can use these predictions to make more informed decisions. They can make the
decision before the movie release.
Social media such as Twitter, YouTube, Facebook have been used for sharing contents and
comments on all types of subjects by millions of people on a daily basis. It is clear that businesses
have a strong interest in tapping into these huge data sources to extract information that might
improve their decision-making process.
For example, predictive models derived from social media for successful movies may facilitate
filmmakers making more profitable decisions. The topic of movies is one of the considerable
interests in the social media user community among all class of people.
1. Scope, Objective & Problem Statement

The widespread usage of the internet has enabled people to share their views with the rest of the
world online. This method of broadcasting opinions has gained a lot of popularity ever since.
However, this led to a decrease in the quality of opinions that were shared. Due to this, people find it
challenging and difficult to browse through all the opinions.
Opinions requiring a text description such as movie reviews, acting, direction, songs reviews would
be much less prone to incorrect/invalid responses. Predicting the Sentiment of people for movie
using the tweets would help provide the success ratio of the movie. The data is collected only from
the well-known Social Media platform like Twitter.
People are very active on Twitter these days and they start sharing all their views right from movie
poster release till the movie is running in theatres by giving their opinion.
Movie reviews have been used before for sentiment analysis. We expect that comments express the
same range of opinions and subjectivity as the movie reviews.
Page | 3
Sentiment analysis aims to uncover the attitude of the person on a particular topic from the written
text. Other terms used to denote this research area include “opinion mining” and “subjectivity
detection”. It uses natural language processing and machine learning techniques to find statistical
and/or linguistic patterns in the text that reveal attitudes.
It has gained popularity in recent years due to its immediate applicability in business environment,
such as summarizing feedback from the product reviews, discovering collaborative
recommendations, or assisting in election campaigns. The focus of our project is the analysis of the
sentiments in the short web site comments.
We expect the short comment to express succinctly and directly person’s opinion on any movie. We
focus on two important properties of text: 1. subjectivity – whether the style of the sentence is
subjective or objective; 2. polarity – whether the person expresses positive or negative opinion. We
use statistical methods to capture the elements of subjective style and the sentence polarity.
Statistical analysis is done on the sentence level. We apply machine learning techniques to classify
set of messages. We are interested in the following questions:
1. To what extent can we extract the subjectivity and polarity from the short comments? What are
the important features that can be extracted from the raw text that have the greatest influence on the
classification?
2. What machine learning techniques are suitable for this purpose?
Problem Statement 1
 Based on the twitter data, what are the sentiments of the people pre and post movie release
Objective 1:
Identify the hashtags related to the movie that would help to garner more tweet information on the
movie which has to be analysed for the sentiment analysis.
Problem Statement 2
 Movie rating categorization based on the polarity of the tweets.
Objective 2:
Using the polarity score categorize the rating levels like bad, average, good, excellent etc. towards
any movie before its release and predict the overall rating of the movie based on the Twitter data.
After the release of movie we have collected the set of fresh tweets and build the model to check if
the prediction of our model gives the same result or not, which will help us to conclude whether the
expectation of the people before movie release is similar to the sentiment analysis of post movie
release.
 To meet our Scope, we have built a model which is capable enough to predict the sentiment of
people for multiple movies which will help others to compare different movie on people’s
sentiment based on twitter
Page | 4
 Also, it is capable to predict the sentiment of people from their tweets provided to model for
different time frames. (Example: 7days or 15 days tweets pre & post release of movie)
We would analyse the data and classify the emotions into following bins as per the Score
obtained for each tweet:
No of stars Rating
1 Poor ; Score < 0
2 Average ; Score >= 0 & Score < 0.5
3 Good ; Score >= 0.5 & Score < 1
4 Very Good; Score>=1 & Score <1.5
5 Excellent ; Score >= 1.5
Table 1
“Score” is the varaibale which we get from sent.value parameterafter passing the tweets
2. Data Source and Description

a) Data Source
Twitter API
Twitter is an innovative microblogging service aired in 2006 with currently more than 550 million
users. The user created status messages are termed tweets by this service. The public timeline of
twitter service displays tweets of all users worldwide and is an extensive source of real-time
information.
The original concept behind microblogging was to provide personal status updates. But the current
scenario surprisingly witnesses tweets covering everything under the world, ranging from current
political affairs to personal experiences.
Movie reviews, travel experiences, current events etc. add to the list. Tweets (and microblogs in
general) are different from reviews in their basic structure. While reviews are characterized by
formal text patterns and are summarized thoughts of authors, tweets are more casual and restricted to
140 characters of text. Tweets offer companies an additional avenue to gather feedback.
Sentiment analysis to research products, movie reviews etc. aid customers in decision making before
making a purchase or planning for a movie. Enterprises find this area useful to research public
opinion of their company and products, or to analyze customer satisfaction.
Organizations utilize this information to gather feedback about newly released products which
supplements in improving further design, as twitter dataset is easily available and has more impact,
so we are using only Twitter data as our main dataset.
Page | 5
Scrapping of data from Twitter:
We can access the Twitter data through the public API which is provided by the Twitter. These APIs
can be accessed only by authentication requests, which must be signed with valid login ID and
password. The authentication keys are provided by Twitter through which we can do the Tweet
extraction. Few steps need to be followed to create the Authentication keys and those steps are as
follows:
1. Create application on twitter.
2. Manage Application
3. Change the permissions to read and write.
4. Retrieve Authentication keys.
After finishing the entire process, we get the unique keys. These Unique keys are required for
collection of tweets from tweeter. The Unique Keys are:
• Consumer key
• Consumer Secret key
• Access token
• Access token secret

The tweets which are collected from the twitter are having information like twitter ID, user ID, date
of tweets, retweet counts etc. for our analysis we are using only the tweets, the tweet date and the
tweet ID. We combine the API to our app so that we can collect all the tweets which is related to our
movie selection and also the comments, controversies or news related to that particular movie.
Figure 1: Twitter data from different Sources
Page | 6
b) Data Description
We will be using the data of 7 and 15 days of pre and post movie release. Sample dataset of the
movie “Tanahji” is mentioned in the appendix section. We will be using the text column of the
dataset for our sentiment analysis. Other columns like source, retweet_count, hashtag, created_at, etc
are being used to understand the overview of the dataset.
Algorithm for Sentiment Analysis:
 The first phase was data acquisition. Here we choose Twitter as our data sources.
 Second phase was data cleaning. After scrapping data from various sources, we have cleaned our
data mainly on unavailability of some features.
 After cleaning all data, next phase is data integration and transformation. In third phase we
classified some features and create corpus of the text data.
 Fourth phase is Sentiment analysis of the Tweets. We use get_sentiment function to get the score
of each tweet and further build the dtm of the tweets which also has a column with the score
value.
 In the Fifth phase, the dataset in divided into two sections: Training dataset includes 70% and
testing dataset includes 30% of the total dataset.
 Sixth phase is Result and Analysis, where we will run our data through different classification
models on our dataset and check the accuracy and AUC value.
 As it is said in Analytics that “Higher the accuracy, better the result”.
Process Flow Diagram
Start
Data Scraping &

Collection using Predict the
Twitter API Attitude of the Result & Analysis
Reviewer
(Twitter Data)
Model Generation
Data Processing,
Using Machine End
Feature
Learning
Engineering
Algorithm
Splitting the
Data Visualization dataset into train
and test datasets
Figure 2: Process Flow diagram
Page | 7
3. Data Pre-processing
a) Data Cleaning
Tweets containing both positive and negative emoticons were not taken into account. The list of
positive emoticons used for labeling the training set includes :), :-),), :D, and =), while the list of
negative emoticons consists of :(, :-(, and: (. Inevitably this simplification results in partially correct
or noisy labeling. The emoticons were stripped out of the training data for the classifier to learn from
other features that describe the tweets.
The tweets were manually labeled based on their sentiment, regardless of the presence of emoticons
in the tweets. As the Twitter community has created its own language to post messages, we explore
the unique properties of this language to better define the feature space.
The following tweet preprocessing options were tested:
 Removal of html tags & symbol “@”. The hyperlinks often present in these tweets in turn
restrict the vocabulary size
 Removal of tweet URL
 Remove pesky Unicode like <U+A>
 Removal of punctuation marks. The basic approach to deal with this is to remove everything that
isn’t a standard number or letter. It should be borne in mind that sometimes punctuations can be
really useful, like web addresses, where the punctuation often defines the web address.
Therefore, the removal of punctuation should be tailored to the specific problem. In our case, we
will remove all punctuations.
 Another pre-processing task we have to do is to remove unhelpful terms. Many words are
frequently used but are only meaningful in a sentence. These are called stop words. Examples
are ‘the’, ‘is’, ‘at’, and ‘which’. It’s unlikely that these words will improve our ability to
understand sentiments, so we want to remove them to reduce the size of the data.
 We change the case of the word to lowercase so that same words are not counted as different
because of lower or upper case.
 Removal of numbers or digits
 Removal of blank spaces both from the beginning and the end of the tweet.
In addition to Twitter-specific text preprocessing, other standard preprocessing steps were performed
to define the feature space for tweet feature vector construction. These include text tokenization,
removal of stopwords, stemming, N-gram construction (concatenating 1 to N stemmed words
appearing consecutively) and using minimum word frequency for feature space reduction.
The resulting terms were used as features in the construction of TF-IDF feature vectors representing
the documents (tweets). TF-IDF stands for term frequencyinverse document frequency feature
weighting scheme where weight reflects how important a word is to a document in a document
collection.
Page | 8
b) Creation of Word Corpuses
A positive word corpus contains all possible positive words which are usually used in tweets
similarly a negative word corpus is also created.
A word corpus contains many numbers of words since tweets are created by various people around
the world in their own style.
So, we had to consider all possible words for the corresponding word corpus especially the analysis
is for Hollywood and Bollywood movies.
Positive Negative
Tweet
Words Words
#BoxOffice Report Day 3 Early Estimates: #DeepikaPadukone's “fails”,

“excellent”
#Chhapaak fails miserably, #AjayDevgn's #Tanhaji excellent “miserably”
Table 2: Example showing tweet and polarity words
While bringing this tweet for analysis, the positive word corpus compares all the tokens and will find
the words “special”, “montage” and assigns a polarity count. The tweet will again pass through the
negative word corpus and will find the word “clichés” and assigns a polarity count. All other words
will be neglected by the word corpuses since they have neutral polarity.
c) Extraction and Tokenization

A number of tweets are collected for testing and are stored in a file. These are extracted one by one
from the file using R program. A sentence in the file is considered as one tweet by keeping
punctuation mark as its base. That is whenever there is a punctuation mark between any two words,
till the previous word before the punctuation mark is taken as one sentence.
Likewise, all the sentences are extracted. Each word in the sentence is also extracted by considering
space between the words as its base.
That is whenever there appears a space between letters in the sentence, the letters before the blank
space is considered as one word. Likewise, all the words in the sentence are extracted and each word
is called as one “token”.
The processing of making tokens from sentences is named as tokenization. Both extraction and
Tokenization are performed one by one.
That is one sentence will be considered at a time. And tokenization will be performed in it. And it
will be proceeded for further steps. After completion of all the steps in one sentence only, another
sentence will be extracted and proceeded for further steps.
Page | 9
d) DTM (Document Term Matrix)
A document-term matrix or term-document matrix is a mathematical matrix that describes the
frequency of terms that occur in a collection of documents. In a document-term matrix, rows
correspond to documents in the collection and columns correspond to terms. There are various
schemes for determining the value that each entry in the matrix should take. One such scheme is tf-
idf.
Document Term Matrix is tracking the term frequency for each term by each document. It starts with
the Bag of Words representation of the documents and then for each document we can track the
number of time a term exists. Term count is a common metric to use in a Document Term Matrix.
4. Exploratory data analysis of the data

Sentiment analysis is broadly classified in the two types first one is a feature or aspect-based
sentiment analysis and the other is objectivity-based sentiment analysis. The tweets related to movie
reviews come under the category of the feature-based sentiment analysis.
Objectivity based sentiment analysis does the exploration of the tweets which are related to the
emotions like hate, miss, love etc. In statistics, exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us
beyond the formal modeling or hypothesis testing task.The EDA approach is precisely that--an
approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be
carried out.
Different approaches which include machine learning (ML) techniques, sentiment lexicons, hybrid
approaches etc. have been proved useful for sentiment analysis on formal texts. But their
effectiveness for extracting sentiment in microblogging data will have to be explored.
A careful investigation of tweets reveals that the 140-character length text restricts the vocabulary
which imparts the sentiment. The varied domains discussed would surely impose hurdles for
training. The frequency of misspellings and slang words in tweets (microblogs in general) is much
higher than in other language resources which is another hurdle that needs to be overcome.
On the other way around the tremendous volume of data available from microblogging websites on
varied domains are incomparable with other data resources available. Microblogging language is
characterized by expressive punctuations which convey a lot of sentiments. Bold lettered phrases,
exclamations, question marks, quoted text etc. leave scope for sentiment extraction.
The proposed work attempts a novel approach on twitter data by aggregating an adapted polarity
lexicon which has learnt from product reviews of the domains under consideration, the tweet specific
features and unigrams to build a classifier model using machine learning techniques.
Here are few EDA techniques which tell us about the microblogging launguage in clear manner and
can build strong point about any movie sentiment. Also, these technique gives us few important
notices about how the movie is trending in the market.
NOTE: Here, all the outputs have been considered for the movie “Tanhaji”
Page | 10
a) Visualization of retweets
Here we are presenting the nature of the tweets which we collected. The below Donut chart is
showing the proportion of the tweets.
Figure 3: Donut chart for type of tweets
b) Plot of most frequent words in the text

The below plot is showing the words which are most frequently used in the tweets collected.
Figure 4: Bar plot for most occurring movie tweets
 N-gram plots: Bag of Words ignores the semantic context of the review and concentrates
primarily on frequency of each word. To overcome that, we also tried ngram modelling wherein
Page | 11
we created unigrams, bigrams and mixture of both. While creating unigrams is more or less
similar to the bag of words approach, bigrams provided more contextual information on the
review text.
Plot of frequency distribution of uni-gram: The type of models that assign probabilities to the
sequences of single word.
Figure 5: Uni-gram distribution
Plot of frequency distribution of bi-gram: The type of models that assign probabilities to the
sequences of two words
Figure 6: Bi-gram distribution
Page | 12
Plot of frequency distribution of Tri-gram: The type of models that assign probabilities to the
sequences of three words
Figure 7: Tri-gram distribution
c) World Cloud of the keywords of the tweets

As word cloud is a text mining method to find the most frequently used words in a text. Here is the
Word Cloud for the most frequent words in the tweets.
Figure 8: Word cloud for frequent words
Page | 13
d) Word Cloud for Account from which most retweets originate
This Word Cloud shows the account from where most of the retweets have been generated.
Figure 9: Word cloud for retweets accounts
e) Word Cloud for the Location from where most of the tweets originate
Figure 10: Word cloud for tweets location
Page | 14
5. Modelling Approach
a. Techniques and software to be used
 Excel
 R/Python:
Sentiment prediction has been a great area of research in the recent times and is a challenging task
especially in morphologically rich languages. The task requires us to classify a given sentence either
as "Positive" or "Negative". In order to do this, we went ahead and tried out multiple deep learning-
based methods.
In this project, we are applying data mining techniques, machine learning algorithms, using several
feature extraction techniques used in text mining and understand their relevance to our problem.
We will develop a methodology on the basis of historical data and current data available from the
data source i.e. Twitter to understand any movie’s viewer sentiment outcome.
We could predict the rating of a given movie just based on its summary but quite often, we have
reviews for the movie along with their summary which could be utilized in order to improve the
prediction capability of our networks. Using the sentiment of these reviews as a prior along with the
summary aids the task at hand.
In order to generate sentiment priors, we used the sentiment classification network mentioned in the
previous section. The actual task of predicting the rating is a regression problem where the output
would be a single floating value between 0.0 to 5.0, in order to simplify the task, we convert it into a
classification problem where we round the true rating to the nearest integer which would give us a
five-class classification problem.
Following are the key variables used during our analysis
 Text (tweets)
 Re-tweet count
 Screen name
 Date of tweet/ re-tweet
 Favorite_count
 hastags
 retweetfavorite_count
 verified
From above list of variables, Text (Tweets) & hastags are the most important variable used for
analysis; however other categories are finally contributing in the final insight of textual data.
Sentiment analysis refers to the use of natural language processing, text analysis and computational
linguistics to extract and identify subjective information in source materials.
We will provide both pre- release tweets & post- release tweets for EDA building and generating
polarity based on their tweets. Here, we are using get_nrc_sentiment (tweets) to find the sentiment of
all the tweets passed.
The get_sentiments function returns a tibble, so to take a look at what is included as “positive” and
“negative” sentiment
Page | 15
Figure 11: Sentiment graph from tweets generated
b. Model Building
We have passed the dataset for both pre and post release to meet our objective of finding the
sentiment of people and comparing it for before the movie release and after it is released.
For these we have built few Classification Model for predicting different levels of rating
 CART Model
“CART: Classification and Regression Trees” is a machine learning algorithm for classification and
regression.Here in the model, CART algorithm mainly works with dividing of recursive training
dataset into partitions to get pure target class, where every node in the tree is related to specific
record set split by a test based on selected feature.
Here is the output of the CART Tree, which represent the different sentiment of people for the
movie “Tanhaji”
Page | 16
Figure 12: Decision Tree
The evaluation of performance is judged by the confusion matrix. It is specific table layout that
allows visualization of the performance of an algorithm.
Confusion matrix for CART model for the pre-release tweets for test dataset which brings the
accuracy of 71.86%.
Figure 13: CART Confusion Matrix for pre-release matrix
Confusion matrix for CART model for the post-release tweets which brings the accuracy of 55.2%
Figure 14: CART Confusion Matrix for post-release matrix
 Random Forest Model
Random Forest classifier provides two types of randomness, first is with respect to data and second
is with respect to features. Random Forest classifier uses the concept of Bagging and Bootstrapping.
Page | 17
As Random Forest is the combination of decision Trees, it deals with multiple number of
hyperparameters which are:
 Number of Trees to construct for the Decision Forest
 Number of features to select at random
 Depth of each trees.
All these hyperparameters are required to be set manually which will be time consuming and does
not guarantee that it will give good results for the parameter that we have set manually. Each of the
hyperparameters have their own importance and influence towards the output prediction. There are
two measures of importance given for each variable in the random forest.
The first measure is based on how much the accuracy decreases when the variable is excluded. The
second measure is based on the decrease of Gini impurity when a variable is chosen to split a node.
Figure15: Variable Importance Plot for pre-release tweets
Confusion matrix for Random Forest model for the post-release tweets which brings the accuracy of
93.26%.
Figure 16: Random Forest Confusion Matrix for pre-release tweets
Confusion matrix for Random Forest model for post-release tweets which brings the accuracy of
91.66%
Page | 18
Figure 17: Random Forest Confusion Matrix for post-release tweets
 Naïve Bayes Classifier

In this model first all the tweets and labels are passed to the classifier. In the next step feature
extraction is done. Now, both these extracted features and tweets are passed to the Naïve Bayesian
classifier. Then train the classifier with this training data. Then the classifier dump file opened in
write back mode and feature words are stored in it along with a classifier. After that the file is close.
Confusion matrix for Naïve Bayes model for pre-release tweets which brings the accuracy of 48.2%
Figure 18: Naïve Bayes Model Confusion Matrix for pre-release tweets
Confusion matrix for Naïve Bayes model for post-release tweets which brings the accuracy of 30.4%
Figure 19: Naïve Bayes Model Confusion Matrix for post-release tweets
 Support Vector Machines

For SVM, we have basically used 3 labels that are 0, 1 and 2. Here the 0 represents positive, 1
represent negative and 2 as neutral. Each word in a tweet is represented as either 0 or 1. If it is
feature word, then represent it with 1 otherwise 0. So, we get a sequence of 0s and 1s. Now this
feature vector and class labels are given to an SVM classifier to classify tweets as positive, Negative,
Neutral.
Page | 19
Confusion matrix for SVN model for pre-release tweets which brings the accuracy of 73.4%
Figure 20: SVM Model Confusion Matrix for pre-release tweets
Confusion matrix for SVN model for post-release tweets which brings the accuracy of 72.25%
Figure 21: SVM Model Confusion Matrix for post-release tweets
c. Comparison chart for accuracy & AUC value of the all Model for Pre-release datasets
Table 3
Page | 20
d. Comparison chart for accuracy & AUC value of the all Model for Post-release datasets
Table 4
Note: This shows that Random Forest Model is performing best.
6. Recommendations & Applications

 This model helps us to find out the sentiments of people towards any movie.
 Another application of this project would be to find a group of viewers with similar movie tastes
(likes or dislikes).
 The sentiment of reviews is a valuable source which would lead to more accurate rating
predictions and would help people to know how movie is faring at the Box-office from a pubic
point of view.
 The sentiment analysis can be helpful for immediately identifying any situations like the
reaction of people towards the teaser or trailer and can give necessary insights to PR &
Management team.
7. Challenges and Limitations

 Various hashtags point out to topic and it is hard to cover all the hashtags which will add any
value to our sentiment analysis task.
 Misleading tweets may influence our analysis.
 Locations of the all users are not available in the data.
 The model is only based on twitter data
 Sarcastic comments are another hurdle that needs to be overcome as it is difficult to make it
differentiate for positive and negative sense.
 As the project is based on twitter data so scraping the data from twitter in a regular interval is
important as twitter only provides data of last 7 to 9 days of any string or hashtag
 This model can also be used for old movies on a condition that one should have data scrapped
from Twitter for the particular movie whose analysis is required.
Page | 21
 Another task in sentiment analysis is subjectivity/objectivity identification where it focuses on
classifying a given text (usually a sentence) into one of the two classes (objective or subjective).
As the subjectivity of words and phrases may depend on their context and an objective
document may contain subjective sentences (a news article quoting people's opinions), this
problem can sometimes be more difficult than polarity classification.
Page | 22
References and Bibliography
 A study on feature selection & classification algorithms–

https://ieeexplore.ieee.org/document/7522583
 Deep learning for sentiment analysis–https://cs224d.stanford.edu/reports/PouransariHadi.pdf
 Movie reviews using Logistic Regression–https://itnext.io/machine-learning-sentiment-analysis-
of-movie-reviews-using-logisticregression-62e9622b4532
 Natural Language Processing SoSe 2016 –
https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/NaturalLanguageProcessing/
NLP2016/NLP09_SentimentAnalysis.pdf
 Opinion Mining on Twitter Data of Movie Reviews using R–
https://pdfs.semanticscholar.org/aad9/3c7978ddc2378e781f62ea12fece439f6d7f.pdf
 Sentiment Analysis – Wikipedia – https://en.wikipedia.org/wiki/Sentiment_analysis
 Sentiment Analysis of Movie Review Using Text Mining–https://acadpubl.eu/hub/2018-119-
16/2/374.pdf
 Sentiment Analysis of Movie Reviews using Machine Learning Techniques–
https://www.researchgate.net/publication/321843804_Sentiment_Analysis_of_Movie_Reviews_
using_Machine_Learning_Techniques
 Sentiment Analysis of Movie Reviews–https://machinelearningmastery.com/prepare-movie-
review-data-sentiment-analysis/
 Tidy text mining - https://www.tidytextmining.com/tidytext.html
Page | 23
8. Appendix
Below is the attached fie for sample dataset and data definition file for movie “Tanhaji”
tanhaji_all sample data_definition.csv

dataset.xlsx
Following shows the R-code & outputs.

R Notebook
#install.packages("twitteR")
#install.packages("RCurl")
#install.packages("httr")
#install.packages("syuzhet")
#install.packages("rtweet")
#install.packages("forestmangr")
#install.packages("tidytext")
#install.packages("slam")
library(twitteR)
library(rtweet)
##
## Attaching package: 'rtweet'
## The following object is masked from 'package:twitteR':
##
## lookup_statuses
library(RCurl)
library(httr)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
##
## content
library(wordcloud)
## Loading required package: RColorBrewer
library(syuzhet)
##
## Attaching package: 'syuzhet'
## The following object is masked from 'package:rtweet':
##
## get_tokens
library(dplyr)
##
## Attaching package: 'dplyr'
Page | 24
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forestmangr)
library(tidytext)
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#setwd("E:\Study\Capstone")
#tweet1 <- read.csv("tanhaji_set3.csv",stringsAsFactors = FALSE)

#tweet2 <- read.csv("tanhaji_set4.csv",stringsAsFactors = FALSE)
#tanhaji_all <- rbind(tweet1,tweet2)
#write.csv(tanhaji_all,"tanhaji_all.csv")
setwd("E:\\Study\\Capstone")
movie_tweets_data <- read.csv("tanhaji_all.csv", stringsAsFactors = FALSE)
Sort the dataset in assending order with date
movie_tweets_data <- movie_tweets_data %>%

arrange(created_at)
head(movie_tweets_data)
## X.1 X user_id status_id created_at screen_name
## 1 66439 66439 2.340511e+09 1.212471e+18 2020-01-01 20:31:06 bjadams156
## 2 66440 66440 2.340511e+09 1.212471e+18 2020-01-01 20:31:06 bjadams156
## 3 60459 60459 4.021178e+08 1.212553e+18 2020-01-02 01:54:20 A_Jay_FanNepal
## 4 56533 56533 1.148293e+18 1.212556e+18 2020-01-02 02:07:49 MizarPradyum
## 5 55158 55158 1.126794e+18 1.212557e+18 2020-01-02 02:12:43 ABHI_ADholic04
## 6 56467 56467 1.148293e+18 1.212558e+18 2020-01-02 02:13:47 MizarPradyum
##
text
Page | 25
## 1 Asking this only to @ajaydevgn fans ..in KRK style !!\nHow many of you feel that
#RohitShetty is making full chuti*a of Devgn on the name of <U+0001F449> mere bade Hain
bhaiya re !!\n#TanhajiTheUnsungWarrior \n#Tanhaji #BhujThePrideOfIndia\nAjay fans ke
alawa koi aur yahan text Kiya to UMKB <U+0001F923>
## 3
@AbhishekDudhai6 @NishantADHolic_ @ajaydevgn After #Tanhaji cannot wait to watch this
## 4
Dhokebaaj of the decade<U+0001F449> @deepikapadukone ... She betrayed the person who has
given her back to back 4 hits(LAK , Race2, Cocktail, Aarakshan) by clashing her movie
#Chhapaak with SAIF Sir's #Tanhaji
## 5
@krish_is_Devil @MizarPradyum Thanks bhai , watch it in 3D for better experience
#TanhajiTheUnsungWarrior #Tanhaji
## 6
40 k completed on BMS #TanhajiTheUnsungWarrior #Tanhaji
## source display_text_width reply_to_status_id reply_to_user_id
## 1 Twitter for iPhone 140 NA NA
## 2 Twitter for iPhone 140 NA NA
## 3 Twitter for Android 104 NA NA
## 5 Twitter for Android 84 1.212438e+18 1.207668e+18
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## 1 <NA> FALSE TRUE 0 58
## 5 krish_is_Devil FALSE FALSE 4 0
## quote_count reply_count hashtags symbols
## 1 NA NA RohitShetty NA
## 2 NA NA RohitShetty NA
## 3 NA NA Tanhaji NA
## 4 NA NA <NA> NA
## 5 NA NA c("TanhajiTheUnsungWarrior", "Tanhaji") NA
## 6 NA NA c("TanhajiTheUnsungWarrior", "Tanhaji") NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co media_expanded_url
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA>
## media_type ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 1 <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> NA
## mentions_user_id
## 1 c("368435117", "65659343")
## 2 c("368435117", "65659343")
## 3 c("110915886", "1109687428778999808", "2955267019", "65659343")
## 4 c("3853108342", "101695592")
## 5 c("1207667977618890752", "1148293149174820864")
Page | 26
## 6 1126794046528077824
## mentions_screen_name lang
## 1 c("teamrb_", "ajaydevgn") en
## 2 c("teamrb_", "ajaydevgn") en
## 3 c("racquel_erika", "AbhishekDudhai6", "NishantADHolic_", "ajaydevgn") en
## 4 c("ClassySaifian", "deepikapadukone") en
## 5 c("krish_is_Devil", "MizarPradyum") en
## 6 ABHI_ADholic04 en
## quoted_status_id quoted_text quoted_created_at quoted_source
## 1 NA <NA> <NA> <NA>
## 2 NA <NA> <NA> <NA>
## 3 NA <NA> <NA> <NA>
## 4 NA <NA> <NA> <NA>
## 5 NA <NA> <NA> <NA>
## 6 NA <NA> <NA> <NA>
## quoted_favorite_count quoted_retweet_count quoted_user_id quoted_screen_name
## 1 NA NA NA <NA>
## 2 NA NA NA <NA>
## 3 NA NA NA <NA>
## 4 NA NA NA <NA>
## 5 NA NA NA <NA>
## 6 NA NA NA <NA>
## quoted_name quoted_followers_count quoted_friends_count quoted_statuses_count
## 1 <NA> NA NA NA
## 2 <NA> NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## quoted_location quoted_description quoted_verified retweet_status_id
## 1 <NA> <NA> NA 1.212400e+18
## 2 <NA> <NA> NA 1.212400e+18
## 3 <NA> <NA> NA 1.212353e+18
## 4 <NA> <NA> NA 1.212400e+18
## 5 <NA> <NA> NA NA
## 6 <NA> <NA> NA 1.212430e+18
##
retweet_text
## 3
@AbhishekDudhai6 @NishantADHolic_ @ajaydevgn After #Tanhaji cannot wait to watch this
## 4
Dhokebaaj of the decade<U+0001F449> @deepikapadukone ... She betrayed the person who has
given her back to back 4 hits(LAK , Race2, Cocktail, Aarakshan) by clashing her movie
#Chhapaak with SAIF Sir's #Tanhaji
## 5
<NA>
## 6
40 k completed on BMS #TanhajiTheUnsungWarrior #Tanhaji
## retweet_created_at retweet_source retweet_favorite_count
## 1 2020-01-01 15:48:21 Twitter for Android 357
Page | 27
## 5 <NA> <NA> NA
## retweet_retweet_count retweet_user_id retweet_screen_name
## 1 58 3.684351e+08 teamrb_
## 2 58 3.684351e+08 teamrb_
## 3 4 1.109159e+08 racquel_erika
## 4 32 3.853108e+09 ClassySaifian
## 5 NA NA <NA>
## 6 2 1.126794e+18 ABHI_ADholic04
## retweet_name retweet_followers_count
## 1 REAL BOXOFFICE 16590
## 2 REAL BOXOFFICE 16590
## 3 forever aamirian 831
## 4 <U+2614><U+FE0F>CLASSY SAIFIAN<U+2614> 1770
## 5 <NA> NA
## 6 ABHI_Tanhaji04 570
## retweet_friends_count retweet_statuses_count retweet_location
## 1 1 5215 New Delhi, India
## 2 1 5215 New Delhi, India
## 3 72 11616 India
## 4 32 21547
## 5 NA NA <NA>
## 6 603 13246
##
retweet_description
## 1
Typing the truth below ..<U+0001F447><U+0001F64F>
## 2
Typing the truth below ..<U+0001F447><U+0001F64F>
## 3 only aamir khan rock my
world...i love u aamir khan for life..i know one day i will see aamir khan up close &
personal
## 4 ur Fav Actor mi8 b a bigger<U+2B50><U+FE0F>but definitely not a better INSAAN than
SAIF SIR..Sir is epitome of Class,Royalness& humbleness.(Fan Account of Megastar SAIF
Sir).
## 5
<NA>
## 6
Tanhaji on 10 jan , watch it in 3d
## retweet_verified place_url place_name place_full_name place_type country
## 1 FALSE <NA> <NA> <NA> <NA> <NA>
## 5 NA <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 1 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url name
## 1 https://twitter.com/bjadams156/status/1212471332127948801 Brandon Adams
## 2 https://twitter.com/bjadams156/status/1212471332127948801 Brandon Adams
## 3 https://twitter.com/A_Jay_FanNepal/status/1212552672995356672 ADFnepal
## 4 https://twitter.com/MizarPradyum/status/1212556070024953856 Ajay Devgn fan
## 5 https://twitter.com/ABHI_ADholic04/status/1212557301426589698 ABHI_Tanhaji04
Page | 28
## 6 https://twitter.com/MizarPradyum/status/1212557570491047936 Ajay Devgn fan
## location
## 1 Winston-Salem,NC
## 2 Winston-Salem,NC
## 3 kathmandu, Nepal
## 4 Indiana, USA
## 5
## 6 Indiana, USA
## description url
## 1 I have a twin sister.I graduated from Glenn High School in 2013. <NA>
## 2 I have a twin sister.I graduated from Glenn High School in 2013. <NA>
## 3 Movie Lovers - @ajayDevgn - #maidaan #Tanhaji #bhuj #RRR <NA>
## 4 welcome to ajay devgn kingdom <NA>
## 5 Tanhaji on 10 jan , watch it in 3d <NA>
## 6 welcome to ajay devgn kingdom <NA>
## protected followers_count friends_count listed_count statuses_count
## 1 FALSE 1171 5002 24 64508
## 2 FALSE 1171 5002 24 64508
## 3 FALSE 4957 554 28 73249
## 4 FALSE 180 461 0 23128
## 5 FALSE 570 603 1 13246
## 6 FALSE 180 461 0 23128
## favourites_count account_created_at verified profile_url
## 1 62563 2014-02-13 03:26:57 FALSE <NA>
## 2 62563 2014-02-13 03:26:57 FALSE <NA>
## 3 16149 2011-10-31 15:40:10 FALSE <NA>
## 4 30028 2019-07-08 18:09:56 FALSE <NA>
## 5 4616 2019-05-10 10:20:10 FALSE <NA>
## 6 30028 2019-07-08 18:09:56 FALSE <NA>
## profile_expanded_url account_lang
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## profile_banner_url
## 1 <NA>
## 2 <NA>
## 3 https://pbs.twimg.com/profile_banners/402117759/1474271309
## profile_background_url
## 1 http://abs.twimg.com/images/themes/theme1/bg.png
## 4 <NA>
## 5 <NA>
## 6 <NA>
## profile_image_url
## 1 http://pbs.twimg.com/profile_images/538421341054312451/1MyROBLw_normal.jpeg
## 2 http://pbs.twimg.com/profile_images/538421341054312451/1MyROBLw_normal.jpeg
## 3 http://pbs.twimg.com/profile_images/1111984479063531521/Vk48l5bv_normal.jpg
## 4 http://pbs.twimg.com/profile_images/1215181148839530496/NUdgSpAR_normal.jpg
## 5 http://pbs.twimg.com/profile_images/1208985936597409792/H6J4Ls-9_normal.jpg
## 6 http://pbs.twimg.com/profile_images/1215181148839530496/NUdgSpAR_normal.jpg
tail(movie_tweets_data)
Page | 29
## 174346 66582 142 1.214896e+18 1.216941e+18 2020-01-14 04:32:37 Chintan64138110
## 174347 66576 136 1.158960e+18 1.216941e+18 2020-01-14 04:32:41 Gajendr16824337
## 174348 66448 8 1.387810e+08 1.216941e+18 2020-01-14 04:32:43 onenonlymahakal
## 174349 66443 3 1.163468e+18 1.216941e+18 2020-01-14 04:32:44 shri0944
## 174350 66441 1 1.669707e+09 1.216941e+18 2020-01-14 04:32:45 pruthwirajdeo
## 174351 66442 2 7.057414e+17 1.216941e+18 2020-01-14 04:32:45 Aryaman111
##
text
## 174346 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174348 As per the reliable
sources #TanhajiTheUnsungWarrior will be declared national film of India by our home
minister @AmitShah in next parliament session.\n\nCongratulations @ajaydevgn @omraut
<U+0001F389><U+0001F389>#Tanhaji
## source display_text_width reply_to_status_id
## 174346 Twitter for Android 140 NA
## 174349 Twitter Web App 140 NA
## reply_to_user_id reply_to_screen_name is_quote is_retweet favorite_count
## 174346 NA <NA> FALSE TRUE 0
## retweet_count quote_count reply_count hashtags symbols
## 174346 25 NA NA Tanhaji NA
## 174348 14 NA NA TanhajiTheUnsungWarrior NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co
## 174346 <NA> <NA> <NA> <NA> <NA>
## 174347 <NA> <NA> <NA> <NA> <NA>
## 174348 <NA> <NA> <NA> <NA> <NA>
## 174349 <NA> <NA> <NA> <NA> <NA>
## 174350 <NA> <NA> <NA> <NA> <NA>
Page | 30
## 174351 <NA> <NA> <NA> <NA> <NA>
## media_expanded_url media_type ext_media_url ext_media_t.co
## 174346 <NA> <NA> <NA> <NA>
## 174347 <NA> <NA> <NA> <NA>
## 174348 <NA> <NA> <NA> <NA>
## 174349 <NA> <NA> <NA> <NA>
## 174350 <NA> <NA> <NA> <NA>
## 174351 <NA> <NA> <NA> <NA>
## ext_media_expanded_url ext_media_type mentions_user_id
## 174346 <NA> NA 2924521080
## 174347 <NA> NA 2924521080
## 174348 <NA> NA 2754072768
## 174349 <NA> NA 2924521080
## 174350 <NA> NA 2924521080
## 174351 <NA> NA 2924521080
## mentions_screen_name lang quoted_status_id quoted_text quoted_created_at
## 174346 davidfrawleyved en NA <NA> <NA>
## 174348 RoninADfannn en NA <NA> <NA>
## quoted_source quoted_favorite_count quoted_retweet_count quoted_user_id
## 174346 <NA> NA NA NA
## 174347 <NA> NA NA NA
## 174348 <NA> NA NA NA
## 174349 <NA> NA NA NA
## 174350 <NA> NA NA NA
## 174351 <NA> NA NA NA
## quoted_screen_name quoted_name quoted_followers_count
## 174346 <NA> <NA> NA
## 174347 <NA> <NA> NA
## 174348 <NA> <NA> NA
## 174349 <NA> <NA> NA
## 174350 <NA> <NA> NA
## 174351 <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location
## 174346 NA NA <NA>
## 174347 NA NA <NA>
## 174348 NA NA <NA>
## 174349 NA NA <NA>
## 174350 NA NA <NA>
## 174351 NA NA <NA>
## quoted_description quoted_verified retweet_status_id
## 174346 <NA> NA 1.216941e+18
## 174347 <NA> NA 1.216941e+18
## 174348 <NA> NA 1.216796e+18
## 174349 <NA> NA 1.216941e+18
## 174350 <NA> NA 1.216941e+18
## 174351 <NA> NA 1.216941e+18
##
retweet_text
Page | 31
<U+0001F389><U+0001F389>#Tanhaji
## 174346 2020-01-14 04:30:55 Twitter Web App 99
## 174347 2020-01-14 04:30:55 Twitter Web App 99
## 174349 2020-01-14 04:30:55 Twitter Web App 99
## 174350 2020-01-14 04:30:55 Twitter Web App 99
## 174351 2020-01-14 04:30:55 Twitter Web App 99
## 174346 25 2924521080 davidfrawleyved
## 174348 14 2754072768 RoninADfannn
## retweet_name
## 174346 Dr David Frawley
## 174348 THE WHITE WOLF..!! #Tanhaji <U+0001F6A9><U+0001F6A9>
## retweet_followers_count retweet_friends_count retweet_statuses_count
## 174346 307220 153 23992
## 174347 307220 153 23992
## 174348 1106 150 14186
## 174349 307220 153 23992
## 174350 307220 153 23992
## 174351 307220 153 23992
## retweet_location
## 174346 Santa Fe, NM USA
## 174348
##
retweet_description
## 174346 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 174348
Page | 32
## 174346 TRUE <NA> <NA> <NA> <NA> <NA>
## 174347 TRUE <NA> <NA> <NA> <NA> <NA>
## 174349 TRUE <NA> <NA> <NA> <NA> <NA>
## 174350 TRUE <NA> <NA> <NA> <NA> <NA>
## 174351 TRUE <NA> <NA> <NA> <NA> <NA>
## status_url
## 174346 https://twitter.com/Chintan64138110/status/1216941161698349057
## 174347 https://twitter.com/Gajendr16824337/status/1216941181264777217
## 174348 https://twitter.com/onenonlymahakal/status/1216941189032640512
## 174349 https://twitter.com/shri0944/status/1216941193944190976
## 174350 https://twitter.com/pruthwirajdeo/status/1216941197249282050
## 174351 https://twitter.com/Aryaman111/status/1216941195357753345
## name location
## 174346 Chintan Kumar
## 174347 Gajendra Singh Shekhawat
## 174348 Nirmal Aj fan <U+0001F1EE><U+0001F1F3> Indore, India
## 174349 Shri (<U+0936><U+094D><U+0930><U+0940>) Bengaluru, India
## 174350 pruthwiraj
## 174351 Vivek
##
description
## 174346
Rashtravaadi | name changed for security reason |
## 174347
Friendly
## 174348
movie n cricket maniac,shiv bhakt n believer
## 174349 Proud Hindian (Hindu+Indian) \n<U+0C9C><U+0CC8>
<U+0CB6><U+0CCD><U+0CB0><U+0CC0> <U+0CB0><U+0CBE><U+0CAE><U+0CCD>
<U+0001F6A9>\n<U+0CAD><U+0CBE><U+0CB0><U+0CA4><U+0CCD> <U+0CAE><U+0CBE><U+0CA4><U+0CBE>
<U+0C95><U+0CBF> <U+0C9C><U+0CC8>
<U+0001F1EE><U+0001F1F3>\n<U+0CB5><U+0C82><U+0CA6><U+0CC7>
<U+0CAE><U+0CBE><U+0CA4><U+0CB0><U+0C82> <U+0001F1EE><U+0001F1F3>
## 174350
## 174351
## url protected followers_count friends_count listed_count statuses_count

## 174346 <NA> FALSE 6 77 0 610
## 174347 <NA> FALSE 13 258 0 63
## 174348 <NA> FALSE 647 2073 9 83860
## 174349 <NA> FALSE 49 334 0 1147
## 174350 <NA> FALSE 39 208 1 801
## 174351 <NA> FALSE 285 1326 0 3480
Page | 33
## 174346 659 2020-01-08 13:06:03 FALSE <NA>
## 174347 2785 2019-08-07 04:35:56 FALSE <NA>
## 174348 15036 2010-04-30 15:30:39 FALSE <NA>
## 174349 1824 2019-08-19 15:10:40 FALSE <NA>
## 174350 1378 2013-08-14 06:24:51 FALSE <NA>
## 174351 16825 2016-03-04 13:07:19 FALSE <NA>
## 174346 <NA> NA
## 174347 <NA> NA
## 174348 <NA> NA
## 174349 <NA> NA
## 174350 <NA> NA
## 174351 <NA> NA
## 174346 <NA>
## 174347 <NA>
## 174350 <NA>
## 174351 <NA>
## 174346 <NA>
## 174347 <NA>
## 174349 <NA>
## 174351 <NA>
##
profile_image_url
## 174346
http://pbs.twimg.com/profile_images/1214897021569470464/PsmQvyxI_normal.jpg
## 174347
http://pbs.twimg.com/profile_images/1158960946838044673/8WZueqx5_normal.jpg
## 174348
http://pbs.twimg.com/profile_images/688317640897638403/ELaY-ZEX_normal.jpg
## 174349
http://pbs.twimg.com/profile_images/1163468598603444224/Ia-OmyqY_normal.jpg
## 174350
http://pbs.twimg.com/profile_images/378800000293680962/f80a13a608555e74bc6c43f883e9eb03_n
ormal.jpeg
## 174351
http://pbs.twimg.com/profile_images/1207229736100884480/s4fAGDXh_normal.jpg
Creating the varibales for pre and post dates of the moview release
# Edit the Release date of the movie in MM/DD/YYYY format
release_date <- as.Date("01/10/2020", format = "%m/%d/%Y")

release_date
## [1] "2020-01-10"
pre_last_week_date <- release_date - 7
pre_last_week_date
## [1] "2020-01-03"
post_first_week_date <- release_date + 7
post_first_week_date
## [1] "2020-01-17"
MOVIE PRE RELEASE ANALYSIS
Page | 34
Filter the pre release data from the dataset
pre_tweets_data <- movie_tweets_data %>%

filter(created_at >= pre_last_week_date & created_at < release_date)
head(pre_tweets_data)
## 1 54102 54102 1.054712e+18 1.212890e+18 2020-01-03 00:14:51 FanOfAjayDevgn1
## 2 54124 54124 1.054712e+18 1.212890e+18 2020-01-03 00:16:23 FanOfAjayDevgn1
## 3 63424 63424 1.184134e+18 1.212891e+18 2020-01-03 00:16:52 theeejay_muc
## 4 60620 60620 1.009151e+08 1.212892e+18 2020-01-03 00:21:51 Tanhaji_25Dec19
## 5 57625 57625 1.122538e+18 1.212897e+18 2020-01-03 00:40:44 Gopinat38606021
## 6 55184 55184 9.445703e+17 1.212900e+18 2020-01-03 00:56:17 AdiansNepal
##
text
## 1 With just days to the release
of #TanhajiTheUnsungWarriror, the makers are making the most of the time to promote the
movie. @ajaydevgn @TanhajiFilm @omraut @SharadK7 @itsKajolD #SaifAliKhan #Tanhaji
\nhttps://t.co/ewbaCNXKq2
## 2
8 days to go #Tanhaji https://t.co/qSsZ0HMA7R
## 3 Pan-Indian movie #Darbar overtakes Telugu biggie
#SarileruNekkevvaru and Hindi biggie #Tanhaji to become the most awaited movie! Let's hit
100K interests before the end of this weekend, folks! 200K before release!
<U+0001F60E><U+0001F918> https://t.co/1djPtbZBo5
## 4 #TANHAJI rated 15 by British censor board #BBFC\nRunning time 130m 30s\n#BBFCInsight
strong violence, bloody images\n#TanhajiTheUnsungWarrior a historical action drama set in
17th century in which a Maratha warrior embarks on a mission to recapture a hill fortress
taken by Mughal
## 6
Tanhaji in hindi belts and Darbar in South to have huge box office openings as per
interest shown in BMS both have 40.4K and more. Chhapak to start on a dull note
everything depends on WOM.#Tanhaji
## 4 Twitter Web App 140 NA NA
## 1 NA NA TanhajiTheUnsungWarriror NA
## 3 NA NA c("Darbar", "SarileruNekkevvaru", "Tanhaji") NA
## 4 NA NA c("TANHAJI", "BBFC", "BBFCInsight") NA
## 5 NA NA c("Darbar", "SarileruNekkevvaru", "Tanhaji") NA
## 6 NA NA <NA> NA
## urls_url urls_t.co urls_expanded_url
## 1 <NA> <NA> <NA>
Page | 35
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## media_url media_t.co
## 1 <NA> <NA>
## 2 http://pbs.twimg.com/media/ENSf8oHUUAI7mcr.jpg https://t.co/qSsZ0HMA7R
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## media_expanded_url media_type
## 1 <NA> <NA>
## 2 https://twitter.com/Meena_Iyer/status/1212770073535864832/photo/1 photo
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## ext_media_url ext_media_t.co
## 1 <NA> <NA>
## 2 http://pbs.twimg.com/media/ENSf8oHUUAI7mcr.jpg https://t.co/qSsZ0HMA7R
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## ext_media_expanded_url
## 1 <NA>
## 2 https://twitter.com/Meena_Iyer/status/1212770073535864832/photo/1
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## ext_media_type mentions_user_id mentions_screen_name lang quoted_status_id
## 1 NA 2390513293 PuneTimesOnline en NA
## 2 NA 143098087 Meena_Iyer en NA
## 3 NA 497770148 RajiniFC en NA
## 4 NA 139639456 BreakingViews4u en NA
## 5 NA 497770148 RajiniFC en NA
## 6 NA 565560313 PradeepBastola en NA
## quoted_text quoted_created_at quoted_source quoted_favorite_count
## 1 <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> NA
## quoted_retweet_count quoted_user_id quoted_screen_name quoted_name
## 1 NA NA <NA> <NA>
## quoted_followers_count quoted_friends_count quoted_statuses_count
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
Page | 36
## 6 NA NA NA
## quoted_location quoted_description quoted_verified retweet_status_id
## 1 <NA> <NA> NA 1.212779e+18
## 2 <NA> <NA> NA 1.212770e+18
## 3 <NA> <NA> NA 1.212737e+18
## 4 <NA> <NA> NA 1.212821e+18
## 5 <NA> <NA> NA 1.212737e+18
## 6 <NA> <NA> NA 1.212581e+18
##
retweet_text
## 1 With just days to the release
of #TanhajiTheUnsungWarriror, the makers are making the most of the time to promote the
movie. @ajaydevgn @TanhajiFilm @omraut @SharadK7 @itsKajolD #SaifAliKhan #Tanhaji
\nhttps://t.co/ewbaCNXKq2
## 2
8 days to go #Tanhaji https://t.co/qSsZ0HMA7R
## 4 #TANHAJI rated 15 by British censor board #BBFC\nRunning time 130m 30s\n#BBFCInsight
strong violence, bloody images\n#TanhajiTheUnsungWarrior a historical action drama set in
17th century in which a Maratha warrior embarks on a mission to recapture a hill fortress
taken by Mughal
## 6
Tanhaji in hindi belts and Darbar in South to have huge box office openings as per
interest shown in BMS both have 40.4K and more. Chhapak to start on a dull note
everything depends on WOM.#Tanhaji
## 1 2020-01-02 16:52:00 TweetDeck 137
## 2 2020-01-02 16:18:12 Twitter for iPhone 205
## 3 2020-01-02 14:06:41 Twitter Web App 410
## 4 2020-01-02 19:39:22 Twitter Web App 18
## 5 2020-01-02 14:06:41 Twitter Web App 410
## 1 37 2390513293 PuneTimesOnline
## 2 41 143098087 Meena_Iyer
## 3 172 497770148 RajiniFC
## 4 5 139639456 BreakingViews4u
## 5 172 497770148 RajiniFC
## 6 2 565560313 PradeepBastola
## retweet_name retweet_followers_count retweet_friends_count
## 1 Pune Times 67861 1261
## 2 Meena Iyer 4414 103
## 3 Rajinikanth Fans <U+0001F918> 56964 275
## 4 Breaking Movies 23424 1709
## 5 Rajinikanth Fans <U+0001F918> 56964 275
## 6 Pradeep Bastola 143 51
## retweet_statuses_count retweet_location
## 1 16903 Pune, India
## 2 2243 India
## 3 29618
## 4 84747 FB.com/BreakMovies
## 5 29618
## 6 1458 Nepal
Page | 37
##
retweet_description
## 1 Official handle of Pune Times. Follow for news about the
city and updates from Bollywood and the Marathi entertainment industry
## 2 Influencer, CEO, Ajay Devgn FFilms, Ex-EDITOR Bombay Times and DNA
After Hours, Author Khullam Khulla. Retweets are not endorsements.
## 3 Rajini FC is a tribute to Superstar Rajinikanth handled by Team Rajinists. Updates
on Thalaivar Rajnikanth's next films and more. #Darbar #Thalaivar168
## 4
!!!!! Engine out, completely !!!!!
## 5 Rajini FC is a tribute to Superstar Rajinikanth handled by Team Rajinists. Updates
on Thalaivar Rajnikanth's next films and more. #Darbar #Thalaivar168
## 6 Tweets are personal, RTs are not
endorsement. I love my country, sports, movies and Medical Science.
## 1 TRUE <NA> <NA> <NA> <NA> <NA>
## status_url
## 1 https://twitter.com/FanOfAjayDevgn1/status/1212890025676816384
## 2 https://twitter.com/FanOfAjayDevgn1/status/1212890414467837953
## 3 https://twitter.com/theeejay_muc/status/1212890536119545858
## 4 https://twitter.com/Tanhaji_25Dec19/status/1212891789150961664
## 5 https://twitter.com/Gopinat38606021/status/1212896540500512768
## 6 https://twitter.com/AdiansNepal/status/1212900453937111040
## name location
## 1 <U+0001F31F>Fan Of Ajay Devgn<U+0001F31F>
## 2 <U+0001F31F>Fan Of Ajay Devgn<U+0001F31F>
## 3 Theeejay Germany
## 4 Sagar New Delhi, India
## 5 Gopinath
## 6 ADF NEPAL Kathmandu
##
description
## 1
TANHAJI \ni love Ajay sir
## 2
TANHAJI \ni love Ajay sir
## 3
Hi, here's is Theeejay! Main Twitter account: @theeejay
## 4
## 5
## 6 @ajaydevgn fan club Nepal. Die hard fan of the King of intensity & versatility, Two
Time national award winner & the real action hero. undisputed king of clash<U+0001F4AA>
## 1 <NA> FALSE 200 73 0 14911
## 2 <NA> FALSE 200 73 0 14911
Page | 38
## 3 <NA> FALSE 79 52 0 599
## 4 <NA> FALSE 1477 1091 11 26284
## 5 <NA> FALSE 354 713 2 30727
## 6 <NA> FALSE 101 265 0 3182
## 1 28093 2018-10-23 12:31:40 FALSE <NA>
## 2 28093 2018-10-23 12:31:40 FALSE <NA>
## 3 490 2019-10-15 15:47:25 FALSE <NA>
## 4 31826 2010-01-01 05:57:11 FALSE <NA>
## 5 38991 2019-04-28 16:27:36 FALSE <NA>
## 6 1282 2017-12-23 14:08:01 FALSE <NA>
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## 5 <NA>
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 5 <NA>
## 1 http://pbs.twimg.com/profile_images/1194577883181699073/S4UiPutd_normal.jpg
## 2 http://pbs.twimg.com/profile_images/1194577883181699073/S4UiPutd_normal.jpg
## 3 http://pbs.twimg.com/profile_images/1187238869311381504/AeBQsav4_normal.jpg
## 4 http://pbs.twimg.com/profile_images/1164493859386085376/cgxrrBl5_normal.png
## 5 http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png
## 6 http://pbs.twimg.com/profile_images/1212555422906769411/uw-yV2xx_normal.jpg
tail(pre_tweets_data)
## X.1 X user_id status_id created_at
## 67488 173184 106744 7.084906e+17 1.215422e+18 2020-01-09 23:56:04
## 67489 173181 106741 1.210608e+18 1.215422e+18 2020-01-09 23:56:21
## 67490 173168 106728 7.936731e+17 1.215422e+18 2020-01-09 23:56:58
## 67491 173167 106727 1.938105e+08 1.215422e+18 2020-01-09 23:57:10
## 67492 173166 106726 8.528880e+17 1.215423e+18 2020-01-09 23:58:32
## 67493 173165 106725 2.986419e+09 1.215423e+18 2020-01-09 23:59:27
## screen_name
## 67488 vk9378
## 67489 DeepakK98376858
## 67490 dev4Ind
## 67491 Ornawalla
## 67492 KrishnamitraHKJ
## 67493 vishalmellark
##
text
## 67488 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Page | 39
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67489
"Bharpoor Entertainer, Lajawab Film #Tanhaji".- Journalist #TanhajiReview\n\nBest Of
Luck & Congratulations In Advance !!!\n@ajaydevgn and Fans !!!
#TanhajiTheUnsungWarrior https://t.co/0oV0N3TDpn
## 67492
Me:- 2morrow my a/c will be @verified & we both will go to see #Tanhaji also
#Boycott_Chhapaak\n\nTwitter :-
<U+0001F447><U+0001F447><U+0001F937><U+0001F3FB><U+200D><U+2642><U+FE0F><U+0001F648><U+00
01F602> https://t.co/eQm6hUrwYm
## 67493
#Thalaiva the day 1 business of #Darbar > #Tanhaji + #Chapaak more than double it is
actually if the worldwide fig <U+0001F60D> one sun .. one moon .. one star ..
superstar ..
## 67490 Twitter for iPhone 140 NA
## retweet_count quote_count reply_count
## 67488 10287 NA NA
## 67489 144 NA NA
## 67490 10287 NA NA
## 67491 10287 NA NA
## 67492 38 NA NA
## 67493 2 NA NA
## hashtags symbols urls_url urls_t.co
## 67488 c("OneWordReview", "Tanhaji", "Tanhaji") NA <NA> <NA>
## 67489 c("Tanhaji", "TanhajiReview") NA <NA> <NA>
## 67492 c("Tanhaji", "Boycott_Chhapaak") NA <NA> <NA>
## 67493 c("Thalaiva", "Darbar", "Tanhaji", "Chapaak") NA <NA> <NA>
## urls_expanded_url media_url media_t.co media_expanded_url media_type
## 67488 <NA> <NA> <NA> <NA> <NA>
## 67489 <NA> <NA> <NA> <NA> <NA>
## 67490 <NA> <NA> <NA> <NA> <NA>
## 67491 <NA> <NA> <NA> <NA> <NA>
## 67492 <NA> <NA> <NA> <NA> <NA>
## 67493 <NA> <NA> <NA> <NA> <NA>
Page | 40
## ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 67488 <NA> <NA> <NA> NA
## 67489 <NA> <NA> <NA> NA
## 67490 <NA> <NA> <NA> NA
## 67491 <NA> <NA> <NA> NA
## 67492 <NA> <NA> <NA> NA
## 67493 <NA> <NA> <NA> NA
## mentions_user_id mentions_screen_name lang
## 67488 99642673 taran_adarsh en
## 67489 1610358128 iSKsCombat_ en
## 67490 99642673 taran_adarsh en
## 67491 99642673 taran_adarsh en
## 67492 c("1064198326990458880", "63796828") c("IamPurn", "verified") en
## 67493 1135988009453547520 Justano84979866 en
## quoted_status_id quoted_text quoted_created_at quoted_source
## 67488 NA <NA> <NA> <NA>
## 67489 NA <NA> <NA> <NA>
## 67490 NA <NA> <NA> <NA>
## 67491 NA <NA> <NA> <NA>
## 67492 NA <NA> <NA> <NA>
## 67493 NA <NA> <NA> <NA>
## quoted_favorite_count quoted_retweet_count quoted_user_id
## 67488 NA NA NA
## 67489 NA NA NA
## 67490 NA NA NA
## 67491 NA NA NA
## 67492 NA NA NA
## 67493 NA NA NA
## 67488 <NA> <NA> NA
## 67489 <NA> <NA> NA
## 67490 <NA> <NA> NA
## 67491 <NA> <NA> NA
## 67492 <NA> <NA> NA
## 67493 <NA> <NA> NA
## 67488 NA NA <NA>
## 67489 NA NA <NA>
## 67490 NA NA <NA>
## 67491 NA NA <NA>
## 67492 NA NA <NA>
## 67493 NA NA <NA>
## 67488 <NA> NA 1.215295e+18
## 67489 <NA> NA 1.215298e+18
## 67490 <NA> NA 1.215295e+18
## 67491 <NA> NA 1.215295e+18
## 67492 <NA> NA 1.215317e+18
## 67493 <NA> NA 1.215396e+18
##
retweet_text
## 67489
"Bharpoor Entertainer, Lajawab Film #Tanhaji".- Journalist #TanhajiReview\n\nBest Of
Luck & Congratulations In Advance !!!\n@ajaydevgn and Fans !!!
#TanhajiTheUnsungWarrior https://t.co/0oV0N3TDpn
Page | 41
## 67492
Me:- 2morrow my a/c will be @verified & we both will go to see #Tanhaji also
#Boycott_Chhapaak\n\nTwitter :-
<U+0001F447><U+0001F447><U+0001F937><U+0001F3FB><U+200D><U+2642><U+FE0F><U+0001F648><U+00
01F602> https://t.co/eQm6hUrwYm
## 67493
#Thalaiva the day 1 business of #Darbar > #Tanhaji + #Chapaak more than double it is
actually if the worldwide fig <U+0001F60D> one sun .. one moon .. one star ..
superstar ..
## 67488 2020-01-09 15:31:45 Twitter for iPad 49234
## 67489 2020-01-09 15:42:11 Twitter Web App 312
## 67490 2020-01-09 15:31:45 Twitter for iPad 49234
## 67491 2020-01-09 15:31:45 Twitter for iPad 49234
## 67488 10287 9.964267e+07 taran_adarsh
## 67489 144 1.610358e+09 iSKsCombat_
## 67490 10287 9.964267e+07 taran_adarsh
## 67491 10287 9.964267e+07 taran_adarsh
## 67492 38 1.064198e+18 IamPurn
## 67493 2 1.135988e+18 Justano84979866
##
retweet_name
## 67488
taran adarsh
## 67489
Sardar Singh
## 67490
taran adarsh
## 67491
taran adarsh
## 67492
Nipun<U+0001F1EE><U+0001F1F3><U+0001F441><U+FE0F><U+0001F443><U+0001F441><U+FE0F><U+0001F
6A9>
## 67493 <U+0930><U+093E><U+0927><U+0947> <U+092E><U+094B><U+0939><U+0928>
<U+0915><U+0947> <U+092B><U+093C><U+0948><U+0928>
## 67488 3741490 168 34910
## 67489 1858 1342 5520
## 67490 3741490 168 34910
## 67491 3741490 168 34910
## 67492 8873 6668 17149
## 67493 3 2 265
##
retweet_location
## 67488
Mumbai, India
## 67489
Page | 42
Punjab, India
## 67490
Mumbai, India
## 67491
Mumbai, India
## 67492 <U+0938><U+093E><U+0930><U+0947> <U+091C><U+0939><U+0949> <U+0938><U+0947>
<U+0905><U+091A><U+094D><U+091B><U+093E><U+0001F449><U+092D><U+093E><U+0930><U+0924>
<U+092A><U+094D><U+092F><U+093E><U+0930><U+093E>
## 67493
##
retweet_description
## 67488
Movie critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 67489
If you even dream of beating me, you better wake up and apologise!!! .\nOnly God Of
Bollywood @BeingSalmanKhan Matters And Rules!!! \n#SalmanKhan Fan
## 67490
## 67491
## 67492
#<U+0932><U+0947><U+0916><U+0915><U+270D><U+FE0F>#<U+0930><U+093E><U+0937><U+094D><U+091F
><U+094D><U+0930><U+0935><U+093E><U+0926><U+0940>,#TaxPayer,TwitsIn<U+2764>
#LovesNature<U+0001F331>,flwdBy<U+0001F449>@SDPachauri
@bhavsarhardiik,@Real_anuj,@shwait_malik @caopmishra @zammit_marc<U+0001F60D> Sis-
@Saritasidh
## 67493
Here only for Salman Khan ..
## 67488 TRUE <NA> <NA> <NA> <NA> <NA>
## 67490 TRUE <NA> <NA> <NA> <NA> <NA>
## 67491 TRUE <NA> <NA> <NA> <NA> <NA>
## status_url
## 67488 https://twitter.com/vk9378/status/1215422014736875521
## 67489 https://twitter.com/DeepakK98376858/status/1215422085570281472
## 67490 https://twitter.com/dev4Ind/status/1215422241980088320
## 67491 https://twitter.com/Ornawalla/status/1215422292391419904
## 67492 https://twitter.com/KrishnamitraHKJ/status/1215422634613071873
## 67493 https://twitter.com/vishalmellark/status/1215422867610861569
## name location
## 67488 Vikas Kumar <U+0001F1EE><U+0001F1F3> Agartala, India
## 67489 Deepak Kumar
## 67490 devp India
## 67491 P B India
## 67492 Krishnamitra Jauhar
## 67493 buttercakeluv
##
description
## 67488
Page | 43
## 67489
Student
## 67490
VandeMatram
## 67491
Liberals are bunch of Chutiyas. A Proud Hindu. Supporter of Truth. No Bullshit, Just
State the Facts. NaMo Fan. No Poverty. Respect <U+270A>.
## 67492 <U+0938><U+0943><U+0937><U+094D><U+091F><U+093F> <U+0939><U+0948>
<U+0939><U+0930><U+093F> <U+092E><U+0928><U+094D><U+0926><U+093F><U+0930>
<U+092E><U+0947><U+0930><U+093E>, <U+0927><U+094D><U+092F><U+093E><U+0928>
<U+0939><U+0948> <U+0938><U+091A><U+094D><U+091A><U+0940>
<U+092A><U+0942><U+091C><U+093E><U+0964>
<U+0938><U+092C><U+092E><U+0947><U+0902> <U+0915><U+0943><U+0937><U+094D><U+0923>
<U+0928><U+093F><U+0939><U+093E><U+0930><U+0942><U+0901>
'<U+091C><U+094C><U+0939><U+0930>',<U+092D><U+093E><U+0935> <U+0928>
<U+0930><U+093E><U+0916><U+0942><U+0901> <U+0926><U+0942><U+091C><U+093E><U+0964><U+0964>
\n<U+0001F449>My tweets are\nin<U+0001F449><U+2665><U+FE0F>Likes<U+0001F448>
## 67493
I apologize in advance. snapchat : vishalmellark | instagram : buttercakeluv
## 67488 <NA> FALSE 378 1358 10 89474
## 67489 <NA> FALSE 18 472 0 189
## 67490 <NA> FALSE 101 234 3 11762
## 67491 <NA> FALSE 37 105 0 13768
## 67492 <NA> FALSE 11013 513 23 214923
## 67493 <NA> FALSE 117 159 0 24693
## 67488 137120 2016-03-12 03:11:37 FALSE <NA>
## 67489 1015 2019-12-27 17:07:01 FALSE <NA>
## 67490 18285 2016-11-02 04:36:34 FALSE <NA>
## 67491 12866 2010-09-22 18:38:41 FALSE <NA>
## 67492 5150 2017-04-14 14:15:19 FALSE <NA>
## 67493 33882 2015-01-17 03:36:10 FALSE <NA>
## 67488 <NA> NA
## 67489 <NA> NA
## 67490 <NA> NA
## 67491 <NA> NA
## 67492 <NA> NA
## 67493 <NA> NA
## 67490 <NA>
## 67491 <NA>
## 67489 <NA>
## 67490 <NA>
## 67491 http://abs.twimg.com/images/themes/theme5/bg.gif
## 67492 <NA>
## 67488 http://pbs.twimg.com/profile_images/1100769543012605952/bpjVOp0C_normal.jpg
## 67489 http://pbs.twimg.com/profile_images/1211877019077640192/4c7pFtKC_normal.jpg
Page | 44
## 67492 http://pbs.twimg.com/profile_images/1004746850199527425/33n7gVGL_normal.jpg
## 67493 http://pbs.twimg.com/profile_images/1213529784237486080/RtX3Uiat_normal.jpg
Retrive orginal tweets

# Remove retweets
movie_tweets_original <- pre_tweets_data[pre_tweets_data$is_retweet==FALSE, ]
# Remove replies
movie_tweets_original <- subset(movie_tweets_original,
is.na(movie_tweets_original$reply_to_status_id))
Favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets)
# favorite_count
movie_tweets_original <- movie_tweets_original %>% arrange(-favorite_count)
movie_tweets_original[1,5]
## [1] "2020-01-08 08:31:48"
#retweet_count
movie_tweets_original <- movie_tweets_original %>% arrange(-retweet_count)
## [1] "2020-01-09 15:31:45"
SHOW THE RATIO OF REPLIES/RETWEETS/ORIGINAL TWEETS
# dataset containing only the retweets and one containing only the replies.
# Keeping only the retweets

movie_retweets <- pre_tweets_data[pre_tweets_data$is_retweet==TRUE,]
# Keeping only the replies

movie_replies <- subset(pre_tweets_data, !is.na(pre_tweets_data$reply_to_status_id))
Create a separate data frame containing the number of original tweets, retweets, and replies
# Creating a data frame
original_count <- nrow(movie_tweets_original)

retweets_count <- nrow(movie_retweets)
replies_count <- nrow(movie_replies)
movie_data <- data.frame(

category=c("Original", "Retweets", "Replies"),
count=c(original_count, retweets_count, replies_count )
)
PLOTTING THE TYPES OF TWEETS (ORIGINAL, REPLIES, RETWEETS)
# Adding columns
movie_data$fraction = movie_data$count / sum(movie_data$count)
movie_data$percentage = movie_data$count / sum(movie_data$count) * 100
movie_data$ymax = cumsum(movie_data$fraction)
movie_data$ymin = c(0, head(movie_data$ymax, n=-1))
# Rounding the movie_data to two decimal points

movie_data <- round_df(movie_data, 2)
# Specify what the legend should say

Type_of_Tweet <- paste(movie_data$category, movie_data$percentage, "%")
Page | 45
ggplot(movie_data, aes(ymax=ymax, ymin=ymin,
xmax=4, xmin=3, fill=Type_of_Tweet)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")
SHOW THE SOURCE DEVICES FROM WHERE THE TWEETS ARE PUBLISHED
tw_app <- pre_tweets_data %>%

select(source) %>%
group_by(source) %>%
summarize(count=n())
tw_app <- subset(tw_app, count > 1000)
device_data <- data.frame(

category=tw_app$source,
count=tw_app$count
)
device_data$fraction = device_data$count / sum(device_data$count)
device_data$percentage = device_data$count / sum(device_data$count) * 100
device_data$ymax = cumsum(device_data$fraction)
device_data$ymin = c(0, head(device_data$ymax, n=-1))
device_data <- round_df(device_data, 2)
Source <- paste(device_data$category, device_data$percentage, "%")
ggplot(device_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built
initially
xlim(c(2, 4)) +
theme_void() +
Page | 46
SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS
#Cleaning the data

movie_tweets_original$text <- gsub("https\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("@\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("amp", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[\r\n]", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[[:punct:]]", "", movie_tweets_original$text)
# remove stop words from the text
tweets <- movie_tweets_original %>%

select(text) %>%
unnest_tokens(word, text)
tweets <- tweets %>%
anti_join(stop_words)
## Joining, by = "word"
PLOT THE MOST FREQUENT WORDS IN THE TWEETS
# gives a bar chart of the most

frequent words found in the tweets
tweets %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in the tweets of the movie",
subtitle = "Stop words removed from the list")
## Selecting by n
Page | 47
SHOW THE MOST FREQUENTLY USED HASHTAGS
movie_tweets_original$hashtags <- as.character(movie_tweets_original$hashtags)

movie_tweets_original$hashtags <- gsub("c\\(", "", movie_tweets_original$hashtags)
set.seed(1234)
wordcloud(movie_tweets_original$hashtags, min.freq=50, scale=c(2, 1), random.order=FALSE,
rot.per=0.35, colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
Page | 48
SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE
set.seed(1234)
wordcloud(pre_tweets_data$retweet_screen_name, min.freq=200, scale=c(2, .5),
random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## drops documents
SHOWS THE LOCATION FROM WHICH THE MOST TWEETS BELONGS
set.seed(1234)
wordcloud(pre_tweets_data$location, min.freq=200, scale=c(3, 1), random.order=FALSE,

rot.per=0.25,
## drops documents
Page | 49
#wordcloud(tweets_data$retweet_location, min.freq=500,colors=brewer.pal(8, "Dark2"))
SHOWS THE LOCATION FROM WHICH THE MOST RETWEETS BELONGS
set.seed(1234)
wordcloud(pre_tweets_data$retweet_location, min.freq=200, scale=c(3, 1),

## drops documents
Page | 50
PERFORM A SENTIMENT ANALYSIS OF THE TWEETS ( “syuzhet” package )
# Converting tweets to ASCII to trackle strange characters

tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")
# removing retweets, in case needed

tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)
# removing mentions, in case needed

tweets <-gsub("@\\w+","",tweets)
ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))
names(sentimentscores) <- "Score"
sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)
rownames(sentimentscores) <- NULL
ggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiments")+ylab("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()
CLEANING THE DATA
pre_tweets_data$text = gsub("&amp", "", pre_tweets_data$text)

pre_tweets_data$text = gsub("&amp", "", pre_tweets_data$text)
pre_tweets_data$text = gsub("rt|RT", "", pre_tweets_data$text) # remove Retweet
pre_tweets_data$text = iconv(pre_tweets_data$text, "latin1", "ASCII", sub="") # Remove
emojis/dodgy unicode
pre_tweets_data$text = gsub("<(.*)>", "", pre_tweets_data$text) # Remove pesky Unicodes
like <U+A>
Page | 51
pre_tweets_data$text = gsub("https(.*)*$", "", pre_tweets_data$text) # remove tweet URL
pre_tweets_data$text = gsub("www[[:alnum:][:punct:]]*","", tolower(pre_tweets_data$text
))
pre_tweets_data$text = gsub("<.*?>", "", pre_tweets_data$text) # remove html tags
pre_tweets_data$text = gsub("@\\w+", "", pre_tweets_data$text) # remove at(@)
pre_tweets_data$text = gsub("[[:punct:]]", "", pre_tweets_data$text) # remove punctuation
pre_tweets_data$text = gsub("\r?\n|\r", " ", pre_tweets_data$text) # remove /n
pre_tweets_data$text = gsub("[[:digit:]]", " ", pre_tweets_data$text) # remove
numbers/Digits
pre_tweets_data$text = gsub("[ |\t]{2,}", " ", pre_tweets_data$text) # remove tabs
pre_tweets_data$text = gsub("^ ", "", pre_tweets_data$text) # remove blank spaces at the
beginning
pre_tweets_data$text = gsub(" $", "", pre_tweets_data$text) # remove blank spaces at the
end
head(pre_tweets_data$text)
## [1] "with just days to the release of tanhajitheunsungwarriror the makers are making
the most of the time to promote the movie saifalikhan tanhaji"
## [2] "days to go tanhaji"
## [3] "panindian movie darbar oveakes telugu biggie sarilerunekkevvaru and hindi biggie
tanhaji to become the most awaited movie lets hit k interests before the end of this
weekend folks k before release"
## [4] "tanhaji rated by british censor board bbfc running time m s bbfcinsight strong
violence bloody images tanhajitheunsungwarrior a historical action drama set in th
century in which a maratha warrior embarks on a mission to recapture a hill foress taken
by mughal"
## [5] "panindian movie darbar oveakes telugu biggie sarilerunekkevvaru and hindi biggie
tanhaji to become the most awaited movie lets hit k interests before the end of this
weekend folks k before release"
## [6] "tanhaji in hindi belts and darbar in south to have huge box office openings as
per interest shown in bms both have k and more chhapak to sta on a dull note everything
depends on womtanhaji"
Create subset of tweets text
set.seed(777) # Make process reproducible

sub_blogs =
pre_tweets_data$text[sample(length(pre_tweets_data$text),length(pre_tweets_data$text)*0.1
)] # make subset
Creating a corpus and cleaning data
sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus

sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white
spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeWords , stopwords("english")) # Remove
english stop words
Tokenizing, calculating frequencies and making plots of n-grams
Page | 52
n_grams_plot <- function(n, data) {
options(mc.cores=1)
# Builds n-gram tokenizer

tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# Create matrix
ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
# make matrix for easy view
ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
# find 20 most frequent n-grams in the matrix
ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))
# plots
ggplot(ngrams_matrix, aes(x=word, y=freq)) +
geom_bar(stat="Identity", fill="pink", colour="black") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") +
ylab("Frequency")
}
Plot of frequency distribution of 1-gram
n_grams_plot(n=1, data=sub_blogs_Corpus)
Page | 53
Page | 54
Create the Corpus and define and get the rating of the movie based on the score given by syuzhet
package
sent.value <- get_sentiment(pre_tweets_data$text)
corpus_tw = Corpus(VectorSource(pre_tweets_data$text))
corpus_tw = tm_map(corpus_tw, tolower)

## Warning in tm_map.SimpleCorpus(corpus_tw, tolower): transformation drops
## documents
corpus_tw = tm_map(corpus_tw, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_tw, removePunctuation): transformation
## drops documents
corpus_tw = tm_map(corpus_tw, removeWords, c(stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus_tw, removeWords, c(stopwords("english"))):
## transformation drops documents
corpus_tw = tm_map(corpus_tw, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus_tw, stemDocument): transformation drops
## documents
frequencies_tw = DocumentTermMatrix(corpus_tw)
sparse_tw = removeSparseTerms(frequencies_tw, 0.995)
Page | 55
sparse_tw.df = as.data.frame(as.matrix(sparse_tw))
colnames(sparse_tw.df) = make.names(colnames(sparse_tw.df))
Classify the tweets based on the scores provided by get_sentiment function into 5 categories.
#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value >= 0 & sent.value

< 0.5, "Average",
# ifelse(sent.value >= 0.5 & sent.value < 1, "Good",
# ifelse(sent.value >=1 & sent.value < 1.5 ,"Very
Good","Excellent"))))
#sparse_tw.df$Polarity = category_sentiment
#table(sparse_tw.df$Polarity)
#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value == 0 , "Ignore",
# ifelse(sent.value > 0 & sent.value < 0.5, "Average",
# ifelse(sent.value >=0.5 & sent.value < 1 ,"Good",
# ifelse(sent.value >= 1 & sent.value < 1.5 ,"Very
Good","Excellent")))))
#table(sparse_tw.df$Polarity)
category_sentiment <- ifelse(sent.value < 0, 1, ifelse(sent.value == 0 , "Ignore",
ifelse(sent.value > 0 & sent.value < 0.5, 2,
ifelse(sent.value >=0.5 & sent.value < 1 ,3,
ifelse(sent.value >= 1 & sent.value < 1.5 ,4,5)))))
sparse_tw.df$Polarity = category_sentiment
table(sparse_tw.df$Polarity)
##
## 1 2 3 4 5 Ignore
## 8345 6124 14411 12434 15069 11110
sparse_tw_new.df <- filter(sparse_tw.df, Polarity != "Ignore")
table(sparse_tw_new.df$Polarity)
##
## 1 2 3 4 5
## 8345 6124 14411 12434 15069
ggplot(data=sparse_tw_new.df, aes(Polarity, fill=Polarity))+ geom_bar()
Page | 56
Working on it
ggplot(sparse_tw.df, aes(x= Polarity)) +

geom_bar(aes(y = ..prop.., fill = Polarity ) , stat="count") +
geom_text(aes( label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", vjust = -.5)+
labs(y = "Percent") +
scale_y_continuous(labels = scales::percent)
Page | 57
BUILD CLASSIFICATION MODELS AND PREDICT FOR PRE RELEASE ANALYSIS
We will use different classification models and check its accuracy and performance. Polarity
will be the dependant variable.
prop.table(table(sparse_tw_new.df$Polarity))
##
## 1 2 3 4 5
## 0.1480056 0.1086143 0.2555912 0.2205275 0.2672614
Extract random 5000 observations from the DTM
#model_data <- sparse_tw_new.df %>% sample_frac(0.10)

model_data <- sample_n(sparse_tw_new.df, 5000)
dim(model_data)
## [1] 5000 423
Split the data into Train and Test
library(caTools)
##
## Attaching package: 'caTools'
## The following object is masked from 'package:RWeka':
##
## LogitBoost
set.seed(777)
model_data$Polarity <- as.factor(model_data$Polarity)
spl = sample.split(model_data$Polarity, SplitRatio = 0.7)
train_data = subset(model_data, spl == TRUE)

test_data = subset(model_data, spl == FALSE)
prop.table(table(train_data$Polarity))
##
## 1 2 3 4 5
## 0.1482857 0.1080000 0.2491429 0.2268571 0.2677143
prop.table(table(test_data$Polarity))
##
## 1 2 3 4 5
## 0.1480000 0.1080000 0.2493333 0.2273333 0.2673333
Build CART Model
# Load the Libraries

library(rpart)
library(rpart.plot)
movie_cart_model = rpart(Polarity ~ ., data=train_data, method="class")
#CART Diagram
prp(movie_cart_model, extra=2)
Page | 58
Predict and Evaluate the Performance of CART train data
predict_cart_train_pre = predict(movie_cart_model, data=train_data, type="class")
# Evaluate the performance - Confusion matrix :

confusion_matrix_cart <- table(train_data$Polarity, predict_cart_train_pre)
confusion_matrix_cart
## predict_cart_train_pre
## 1 2 3 4 5
## 1 293 0 217 1 8
## 2 0 81 292 1 4
## 3 2 0 847 3 20
## 4 6 0 182 583 23
## 5 0 0 166 52 719
# Baseline accuracy
accuracy_cart_train_pre = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_train_pre
## [1] 0.7208571
Predict and Evaluate the Performance of CART on test data
predict_cart_test_pre = predict(movie_cart_model, newdata=test_data, type="class")

confusion_matrix_cart <- table(test_data$Polarity, predict_cart_test_pre)
## predict_cart_test_pre
## 1 2 3 4 5
## 1 133 0 88 0 1
## 2 0 35 122 0 5
## 3 0 0 363 1 10
Page | 59
## 4 6 0 77 250 8
## 5 0 0 73 31 297
# Baseline accuracy
accuracy_cart_test_pre = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_test_pre
## [1] 0.7186667
AUC-ROC Curve for CART on Train and Test dataset
#Train data - Plot ROC curve

roc_obj_cart_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_cart_train_pre),quiet=T
RUE)
roc_obj_cart_train_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_cart_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_train_pre) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8387
#Test data - Plot ROC curve
roc_obj_cart_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_pre),quiet=TRU
E)
roc_obj_cart_test_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_cart_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_pre) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
Comparison of all the performace measure of CART Model on Train and Test dataset
results_cart_train_pre = data.frame(accuracy_cart_train_pre,
as.numeric(roc_obj_cart_train_pre$auc))
names(results_cart_train_pre) = c("ACCURACY", "AUC-ROC" )
results_cart_test_pre =
data.frame(accuracy_cart_test_pre,as.numeric(roc_obj_cart_test_pre$auc) )
names(results_cart_test_pre) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_cart_train_pre, results_cart_test_pre)

row.names(df_fin) = c('CART Train Pre', 'CART Test Pre')
df_fin
## ACCURACY AUC-ROC
## CART Train Pre 0.7208571 0.8386559
## CART Test Pre 0.7186667 0.8389384
Page | 60
Build Random Forest Model
# Load Library
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(777)
movie_rf_model = randomForest(Polarity ~ ., data=train_data,importance=TRUE)

movie_rf_model
##
## Call:
## randomForest(formula = Polarity ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 20
##
## OOB estimate of error rate: 6.46%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 458 14 44 2 1 0.11753372
## 2 8 335 29 1 5 0.11375661
## 3 12 5 849 5 1 0.02637615
## 4 7 6 32 741 8 0.06675063
## 5 4 9 29 4 891 0.04909285
Predict and Evaluate the Performance of Random Forest on train
# Make predictions:
predict_rf_train_pre = predict(movie_rf_model, data=train_data,type="response")

confusion_matrix_rf <- table(train_data$Polarity, predict_rf_train_pre)
confusion_matrix_rf
## predict_rf_train_pre
## 1 2 3 4 5
## 1 458 14 44 2 1
## 2 8 335 29 1 5
## 3 12 5 849 5 1
## 4 7 6 32 741 8
## 5 4 9 29 4 891
# Baseline accuracy:
accuracy_rf_train_pre = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_train_pre
## [1] 0.9354286
Page | 61
Predict and Evaluate the Performance of Random Forest on test data
# Make predictions:
predict_rf_test_pre = predict(movie_rf_model, newdata=test_data,type="response")

confusion_matrix_rf <- table(test_data$Polarity, predict_rf_test_pre)
confusion_matrix_rf
## predict_rf_test_pre
## 1 2 3 4 5
## 1 203 5 14 0 0
## 2 6 139 12 1 4
## 3 6 2 365 0 1
## 4 4 1 17 318 1
## 5 2 0 21 4 374
accuracy_rf_test_pre = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_test_pre
## [1] 0.9326667
Variable Importance of Random Forest
#Variable importance:
#varImpPlot(movie_rf_model,main='Variable Importance Plot: Movie ',type=2)
varImpPlot(movie_rf_model,sort=TRUE,type=NULL, class=NULL, scale=TRUE,cex=.8)

roc_obj_rf_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_rf_train_pre),quiet=TRU
E)
roc_obj_rf_train_pre
Page | 62
##
## Call:
as.numeric(predict_rf_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_rf_train_pre) with 5 levels of
roc_obj_rf_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_pre),quiet=TRU
E)
roc_obj_rf_test_pre
##
## Call:
as.numeric(predict_cart_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_pre) with 5 levels of
Comparison of all the performace measure of Random Forest Model on Train and Test dataset
results_rf_train_pre = data.frame(accuracy_rf_train_pre,
as.numeric(roc_obj_rf_train_pre$auc))
names(results_rf_train_pre) = c("ACCURACY", "AUC-ROC" )
results_rf_test_pre = data.frame(accuracy_rf_test_pre,as.numeric(roc_obj_rf_test_pre$auc)
)
names(results_rf_test_pre) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_rf_train_pre, results_rf_test_pre)

row.names(df_fin) = c('Random Forest Train Pre', 'Random Forest Test Pre')
df_fin
## ACCURACY AUC-ROC
## Random Forest Train Pre 0.9354286 0.9584646
## Random Forest Test Pre 0.9326667 0.8389384
Build SVM Model
set.seed(123)
library(e1071)
movie_svm_model = svm(Polarity ~ . , data = train_data)

## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'survey' and 'lagaan' and 'tadka' and 'britishcensor' and 'azaadi'
## and 'raheng' and 'yaaar' constant. Cannot scale data.
Predict and Evaluate the performance of SVM Model on train data
# Make predictions:
predict_svm_train_pre = predict(movie_svm_model, data=train_data, decision.values=TRUE)

confusion_matrix_svm <- table(train_data$Polarity, predict_svm_train_pre)
confusion_matrix_svm
Page | 63
## predict_svm_train_pre
## 1 2 3 4 5
## 1 272 10 181 7 49
## 2 1 116 203 0 58
## 3 0 0 857 1 14
## 4 3 0 220 524 47
## 5 0 1 105 1 830
accuracy_svm_train_pre = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_train_pre
## [1] 0.7425714
Predict and Evaluate the performance of SVM Model on test data
# Make predictions:
predict_svm_test_pre = predict(movie_svm_model, newdata=test_data,decision.values=TRUE)

confusion_matrix_svm <- table(test_data$Polarity, predict_svm_test_pre)
## predict_svm_test_pre
## 1 2 3 4 5
## 1 123 4 74 2 19
## 2 0 50 92 0 20
## 3 0 0 363 0 11
## 4 1 0 98 218 24
## 5 0 0 50 4 347
accuracy_svm_test_pre = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_test_pre
## [1] 0.734
AUC-ROC Curve for SVM model on Train and Test dataset
library(pROC)

roc_obj_svm_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_svm_train_pre),quiet=TR
UE)
roc_obj_svm_train_pre
##
## Call:
as.numeric(predict_svm_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_svm_train_pre) with 5 levels of
roc_obj_svm_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_svm_test_pre),quiet=TRUE
)
roc_obj_svm_test_pre
##
## Call:
Page | 64
as.numeric(predict_svm_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_svm_test_pre) with 5 levels of
Comparison of all the performace measure of SVM Model on Train and Test dataset
results_svm_train_pre = data.frame(accuracy_svm_train_pre,
as.numeric(roc_obj_svm_train_pre$auc))
names(results_svm_train_pre) = c("ACCURACY", "AUC-ROC" )
results_svm_test_pre =
data.frame(accuracy_svm_test_pre,as.numeric(roc_obj_svm_test_pre$auc) )
names(results_svm_test_pre) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_svm_train_pre, results_svm_test_pre)

row.names(df_fin) = c('SVM Train Pre', 'SVM Test Pre')
df_fin
## ACCURACY AUC-ROC
## SVM Train Pre 0.7425714 0.8089172
## SVM Test Pre 0.7340000 0.8153728
Build Naive Bayes Model
set.seed(777)
movie_nb_model = naiveBayes(Polarity ~ . , usekernel=T, data = train_data)
Predict and Evaluate the performance of NB Model on train data
# Make predictions:
predict_nb_train_pre = predict(movie_nb_model, train_data, type = "class")

confusion_matrix_nb <- table(train_data$Polarity, predict_nb_train_pre)
confusion_matrix_nb
## predict_nb_train_pre
## 1 2 3 4 5
## 1 370 72 12 65 0
## 2 28 290 6 54 0
## 3 77 136 199 460 0
## 4 52 77 8 657 0
## 5 181 245 89 247 175
accuracy_nb_train_pre = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_train_pre
## [1] 0.4831429
Predict and Evaluate the performance of NB Model on test data
# Make predictions:
predict_nb_test_pre = predict(movie_nb_model,newdata = test_data, type = "class")
Page | 65
confusion_matrix_nb <- table(test_data$Polarity, predict_nb_test_pre)
confusion_matrix_nb
## predict_nb_test_pre
## 1 2 3 4 5
## 1 155 36 5 26 0
## 2 10 126 3 23 0
## 3 43 77 92 162 0
## 4 20 39 7 275 0
## 5 79 109 29 109 75
accuracy_nb_test_pre = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_test_pre
## [1] 0.482
AUC-ROC Curve for Naive Bayes model on Train and Test dataset
library(pROC)

roc_obj_nb_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_nb_train_pre),quiet=TRU
E)
roc_obj_nb_train_pre
##
## Call:
as.numeric(predict_nb_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_nb_train_pre) with 5 levels of
roc_obj_nb_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_nb_test_pre),quiet=TRUE)
roc_obj_nb_test_pre
##
## Call:
as.numeric(predict_nb_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_nb_test_pre) with 5 levels of as.numeric(test_data$Polarity):
1, 2, 3, 4, 5.
Comparison of all the performace measure of Naive Bayes Model on Train and Test dataset
results_nb_train_pre = data.frame(accuracy_nb_train_pre,
as.numeric(roc_obj_nb_train_pre$auc))
names(results_nb_train_pre) = c("ACCURACY", "AUC-ROC" )
results_nb_test_pre = data.frame(accuracy_nb_test_pre,as.numeric(roc_obj_nb_test_pre$auc)
)
names(results_nb_test_pre) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_nb_train_pre, results_nb_test_pre)
Page | 66
row.names(df_fin) = c('Naive Bayes Train Pre', 'Naive Bayes Test Pre')
df_fin
## ACCURACY AUC-ROC
## Naive Bayes Train Pre 0.4831429 0.7284567
## Naive Bayes Test Pre 0.4820000 0.7202593
Comparing best of both the models - LDA and SVM with their performance - Accuracy , Sensitivity,
Specificity and AUC-ROC Curve
df_fin =rbind(results_cart_train_pre, results_cart_test_pre, results_rf_train_pre,

results_rf_test_pre,
results_svm_train_pre,results_svm_test_pre,results_nb_train_pre,results_nb_test_pre)
row.names(df_fin) = c('CART Train Pre', 'CART Test Pre','Random Forest Train Pre','Random
Forest Test Pre', 'SVM Train Pre','SVM Test Pre','Naive Bayes Train Pre','Naive Bayes
Test Pre')
#round(df_fin,2)
#install.packages("kableExtra")
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
print("Model Performance Comparison Metrics ")
## [1] "Model Performance Comparison Metrics "
kable(round(df_fin,2)) %>%
kable_styling(c("striped","bordered"))
ACCURACY
AUC-ROC
CART Train Pre
0.72
0.84
CART Test Pre
0.72
0.84
Random Forest Train Pre
0.94
0.96
Random Forest Test Pre
Page | 67
0.93
0.84
SVM Train Pre
0.74
0.81
SVM Test Pre
0.73
0.82
Naive Bayes Train Pre
0.48
0.73
Naive Bayes Test Pre
0.48
0.72
POST MOVIE RELEASE ANALYSIS -
Read the tweets after the movie release
post_tweets_data <- movie_tweets_data %>%

filter(created_at >= release_date & created_at < post_first_week_date)
head(post_tweets_data)
## 1 158119 91679 1.039020e+18 1.215423e+18 2020-01-10 00:00:23 AJAY_sardar_
## 2 170582 104142 1.214066e+18 1.215423e+18 2020-01-10 00:00:53 Vastha421
## 3 164898 98458 9.079010e+17 1.215423e+18 2020-01-10 00:01:19 RAMUKUM01606330
## 4 172223 105783 8.842913e+07 1.215423e+18 2020-01-10 00:01:42 PARVATHINP
## 5 158922 92482 8.525627e+17 1.215423e+18 2020-01-10 00:01:55 Ankit_patel_AP
## 6 173164 106724 9.325149e+17 1.215423e+18 2020-01-10 00:01:55 NanoMIndia
##
text
## 1
Movie #Tanhaji is not holiday release so comparing its advance with holiday releases only
show ur jealous soul.
## 2
I can see a certain honesty in the early reviews of #Tanhaji which are coming out!! This
is so bloody rare in these times! And the kind of things people are saying,i am now so so
excited to catch it at the earliest! #TanhajiReview
## 4
Page | 68
#TanhajiTheUnsungWarrior Media Screening report is EPIC , my friend who is attending the
screening in mumbai saying its best film of @ajaydevgn career. #Tanhaji
## 5
WHAT A FILM YAAAR!<U+0001F525><U+0001F525>\nThe climax had the heart-beat racing like
never before. This is how Historic WAR films should be made - Bollywood's Best so far!
@OmRaut & @AjayDevgn TAKE A BOW<U+0001F64F>\n \n#Tanhaji #TanhajiReview
## 3 <NA> FALSE TRUE 0 10287
## 6 <NA> FALSE TRUE 0 10287
## 3 NA NA c("OneWordReview", "Tanhaji", "Tanhaji") NA
## 4 NA NA TanhajiTheUnsungWarrior NA
## 5 NA NA <NA> NA
## 6 NA NA c("OneWordReview", "Tanhaji", "Tanhaji") NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co media_expanded_url
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA>
## media_type ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 1 <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> NA
## mentions_user_id mentions_screen_name lang quoted_status_id quoted_text
## 1 3086592157 OpinionsRP en NA <NA>
## 2 1337106955 ajay36mittal en NA <NA>
## 3 99642673 taran_adarsh en NA <NA>
## 4 146937987 SumitkadeI en NA <NA>
## 5 142231741 Nilzrav en NA <NA>
## 6 99642673 taran_adarsh en NA <NA>
## quoted_created_at quoted_source quoted_favorite_count quoted_retweet_count
Page | 69
## quoted_user_id quoted_screen_name quoted_name quoted_followers_count
## 1 NA <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location quoted_description
## quoted_verified retweet_status_id
## 1 NA 1.215320e+18
## 2 NA 1.215349e+18
## 3 NA 1.215295e+18
## 4 NA 1.215203e+18
## 5 NA 1.215344e+18
## 6 NA 1.215295e+18
##
retweet_text
## 1
Movie #Tanhaji is not holiday release so comparing its advance with holiday releases only
show ur jealous soul.
## 2
I can see a certain honesty in the early reviews of #Tanhaji which are coming out!! This
is so bloody rare in these times! And the kind of things people are saying,i am now so so
excited to catch it at the earliest! #TanhajiReview
## 4
#TanhajiTheUnsungWarrior Media Screening report is EPIC , my friend who is attending the
screening in mumbai saying its best film of @ajaydevgn career. #Tanhaji
## 5
WHAT A FILM YAAAR!<U+0001F525><U+0001F525>\nThe climax had the heart-beat racing like
never before. This is how Historic WAR films should be made - Bollywood's Best so far!
@OmRaut & @AjayDevgn TAKE A BOW<U+0001F64F>\n \n#Tanhaji #TanhajiReview
## 3 2020-01-09 15:31:45 Twitter for iPad 49234
## 6 2020-01-09 15:31:45 Twitter for iPad 49234
## retweet_retweet_count retweet_user_id retweet_screen_name retweet_name
## 1 17 3086592157 OpinionsRP Ash
## 2 24 1337106955 ajay36mittal JabTakHaiCinema
## 3 10287 99642673 taran_adarsh taran adarsh
## 4 948 146937987 SumitkadeI Sumit kadel
## 5 631 142231741 Nilzrav N J
Page | 70
## 6 10287 99642673 taran_adarsh taran adarsh
## 1 1526 193 16988
## 2 187 440 12324
## 3 3741490 168 34910
## 4 87103 88 20466
## 5 5055 959 56183
## 6 3741490 168 34910
## retweet_location
## 1 Seven Heaven
## 2
## 3 Mumbai, India
## 4 Kolkata, West Bengal.
## 5 India/ UAE
## 6 Mumbai, India
##
retweet_description
## 1
......
## 2 Engineer,MBA in Finance but defined by my love for
Movies,Acting,Singing,Dancing,Music,Cricket!\nExtremely Occasional Blog 'FILMALAYA' at
following link:
## 3 Movie
critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 4 Film Trade analyst | Critic | Influencer | Youtube channel -
https://t.co/CaHFAF2LD5 . For work related query email me at - Sumitkadel21@yahoo.com
## 5 Office Manager | Janta's Movie Reviewer | #Chhapaak Review: TODAY | All Things
Humor, Films & 90s Bollywood | Fun RTs | RATIONALIST | Gujju |Food & Freedom
## 6 Movie
critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## status_url
## 1 https://twitter.com/AJAY_sardar_/status/1215423099513888770
## 2 https://twitter.com/Vastha421/status/1215423225473064960
## 3 https://twitter.com/RAMUKUM01606330/status/1215423336399826944
## 4 https://twitter.com/PARVATHINP/status/1215423430809411585
## 5 https://twitter.com/Ankit_patel_AP/status/1215423488988598273
## 6 https://twitter.com/NanoMIndia/status/1215423485943541760
##
name
## 1
AJAY
## 2
Vastha42
## 3
RAMU KUMAR
## 4
Page | 71
PARVATHI P
## 5
Ankit<U+2694>
## 6 <U+092E><U+093F><U+091F><U+094D><U+091F><U+0940> <U+0915><U+093E>
<U+092E><U+093E><U+0927><U+094B>
## location
## 1
## 2
## 3 Kuwait
## 4
## 5 follows you
## 6 Navi Mumbai, India
##
description
## 1 bhakt of AJAY
DEVGN, MSDhoni nd MODI JI!!\n<U+0001F1EE><U+0001F1F3><U+0001F1EE><U+0001F1F3>
## 2
## 3 FROM..VIL..MUSAHARI..
POST...KAIL GHAR..PS..BARHARIYA..DST...SIWAN...BIHAR.. LIVE..IN.. KUWAIT.. CITY..
## 4
India is my country.. Bharath Mata ki Jai..
## 5
MovieLover ... \n\n\n\n@ajaydevgn\n\n\n\n... SportLover \n\n@msdhoni
## 6 Staunch follower of Sanatana Dharma,I support Hindutva. \nTotally believe in One
Nation-One Rule. let's unite against terror and everything which is NOT indian.
## 1 <NA> FALSE 56 349 0 2232
## 2 <NA> FALSE 16 143 0 1781
## 3 <NA> FALSE 18 101 0 273
## 4 <NA> FALSE 846 1329 5 33709
## 5 <NA> FALSE 216 306 0 14024
## 6 <NA> FALSE 2098 4198 1 42894
## 1 7087 2018-09-10 05:17:12 FALSE <NA>
## 2 2021 2020-01-06 06:05:54 FALSE <NA>
## 3 7099 2017-09-13 09:37:22 FALSE <NA>
## 4 27070 2009-11-08 14:28:26 FALSE <NA>
## 5 22156 2017-04-13 16:42:53 FALSE <NA>
## 6 44266 2017-11-20 07:44:18 FALSE <NA>
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## 2 <NA>
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 5 <NA>
Page | 72
## 6 <NA>
## 1 http://pbs.twimg.com/profile_images/1163969231710371840/_dPMB5kq_normal.jpg
## 3 http://pbs.twimg.com/profile_images/1106975563229605888/GsW5lHOn_normal.jpg
## 4 http://pbs.twimg.com/profile_images/835309518246420480/lLeww3af_normal.jpg
## 5 http://pbs.twimg.com/profile_images/1215169883228401664/9nfFxm-W_normal.jpg
## 6 http://pbs.twimg.com/profile_images/1201882496511688705/-XcIJvSA_normal.jpg
tail(post_tweets_data)
## 105848 66582 142 1.214896e+18 1.216941e+18 2020-01-14 04:32:37 Chintan64138110
## 105849 66576 136 1.158960e+18 1.216941e+18 2020-01-14 04:32:41 Gajendr16824337
## 105850 66448 8 1.387810e+08 1.216941e+18 2020-01-14 04:32:43 onenonlymahakal
## 105851 66443 3 1.163468e+18 1.216941e+18 2020-01-14 04:32:44 shri0944
## 105852 66441 1 1.669707e+09 1.216941e+18 2020-01-14 04:32:45 pruthwirajdeo
## 105853 66442 2 7.057414e+17 1.216941e+18 2020-01-14 04:32:45 Aryaman111
##
text
<U+0001F389><U+0001F389>#Tanhaji
## 105851 Twitter Web App 140 NA
## retweet_count quote_count reply_count hashtags symbols
Page | 73
## 105850 14 NA NA TanhajiTheUnsungWarrior NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co
## 105848 <NA> <NA> <NA> <NA> <NA>
## 105849 <NA> <NA> <NA> <NA> <NA>
## 105850 <NA> <NA> <NA> <NA> <NA>
## 105851 <NA> <NA> <NA> <NA> <NA>
## 105852 <NA> <NA> <NA> <NA> <NA>
## 105853 <NA> <NA> <NA> <NA> <NA>
## media_expanded_url media_type ext_media_url ext_media_t.co
## 105848 <NA> <NA> <NA> <NA>
## 105849 <NA> <NA> <NA> <NA>
## 105850 <NA> <NA> <NA> <NA>
## 105851 <NA> <NA> <NA> <NA>
## 105852 <NA> <NA> <NA> <NA>
## 105853 <NA> <NA> <NA> <NA>
## ext_media_expanded_url ext_media_type mentions_user_id
## 105848 <NA> NA 2924521080
## 105849 <NA> NA 2924521080
## 105850 <NA> NA 2754072768
## 105851 <NA> NA 2924521080
## 105852 <NA> NA 2924521080
## 105853 <NA> NA 2924521080
## mentions_screen_name lang quoted_status_id quoted_text quoted_created_at
## 105850 RoninADfannn en NA <NA> <NA>
## quoted_source quoted_favorite_count quoted_retweet_count quoted_user_id
## 105848 <NA> NA NA NA
## 105849 <NA> NA NA NA
## 105850 <NA> NA NA NA
## 105851 <NA> NA NA NA
## 105852 <NA> NA NA NA
## 105853 <NA> NA NA NA
## 105848 <NA> <NA> NA
## 105849 <NA> <NA> NA
## 105850 <NA> <NA> NA
## 105851 <NA> <NA> NA
## 105852 <NA> <NA> NA
## 105853 <NA> <NA> NA
## 105848 NA NA <NA>
## 105849 NA NA <NA>
## 105850 NA NA <NA>
## 105851 NA NA <NA>
## 105852 NA NA <NA>
## 105853 NA NA <NA>
## 105848 <NA> NA 1.216941e+18
## 105849 <NA> NA 1.216941e+18
## 105850 <NA> NA 1.216796e+18
## 105851 <NA> NA 1.216941e+18
## 105852 <NA> NA 1.216941e+18
## 105853 <NA> NA 1.216941e+18
Page | 74
##
retweet_text
<U+0001F389><U+0001F389>#Tanhaji
## 105848 2020-01-14 04:30:55 Twitter Web App 99
## 105849 2020-01-14 04:30:55 Twitter Web App 99
## 105851 2020-01-14 04:30:55 Twitter Web App 99
## 105852 2020-01-14 04:30:55 Twitter Web App 99
## 105853 2020-01-14 04:30:55 Twitter Web App 99
## 105850 14 2754072768 RoninADfannn
## retweet_name
## 105850 THE WHITE WOLF..!! #Tanhaji <U+0001F6A9><U+0001F6A9>
## 105848 307220 153 23992
## 105849 307220 153 23992
## 105850 1106 150 14186
## 105851 307220 153 23992
## 105852 307220 153 23992
## 105853 307220 153 23992
## retweet_location
## 105850
Page | 75
##
retweet_description
## 105850
## 105848 TRUE <NA> <NA> <NA> <NA> <NA>
## 105849 TRUE <NA> <NA> <NA> <NA> <NA>
## 105851 TRUE <NA> <NA> <NA> <NA> <NA>
## 105852 TRUE <NA> <NA> <NA> <NA> <NA>
## 105853 TRUE <NA> <NA> <NA> <NA> <NA>
## status_url
## 105848 https://twitter.com/Chintan64138110/status/1216941161698349057
## 105849 https://twitter.com/Gajendr16824337/status/1216941181264777217
## 105850 https://twitter.com/onenonlymahakal/status/1216941189032640512
## 105851 https://twitter.com/shri0944/status/1216941193944190976
## 105852 https://twitter.com/pruthwirajdeo/status/1216941197249282050
## 105853 https://twitter.com/Aryaman111/status/1216941195357753345
## name location
## 105848 Chintan Kumar
## 105849 Gajendra Singh Shekhawat
## 105850 Nirmal Aj fan <U+0001F1EE><U+0001F1F3> Indore, India
## 105851 Shri (<U+0936><U+094D><U+0930><U+0940>) Bengaluru, India
## 105852 pruthwiraj
## 105853 Vivek
##
description
## 105848
Rashtravaadi | name changed for security reason |
## 105849
Friendly
## 105850
movie n cricket maniac,shiv bhakt n believer
## 105851 Proud Hindian (Hindu+Indian) \n<U+0C9C><U+0CC8>
<U+0CB6><U+0CCD><U+0CB0><U+0CC0> <U+0CB0><U+0CBE><U+0CAE><U+0CCD>
<U+0001F6A9>\n<U+0CAD><U+0CBE><U+0CB0><U+0CA4><U+0CCD> <U+0CAE><U+0CBE><U+0CA4><U+0CBE>
<U+0C95><U+0CBF> <U+0C9C><U+0CC8>
<U+0001F1EE><U+0001F1F3>\n<U+0CB5><U+0C82><U+0CA6><U+0CC7>
<U+0CAE><U+0CBE><U+0CA4><U+0CB0><U+0C82> <U+0001F1EE><U+0001F1F3>
## 105852
Page | 76
## 105853

## 105848 <NA> FALSE 6 77 0 610
## 105849 <NA> FALSE 13 258 0 63
## 105850 <NA> FALSE 647 2073 9 83860
## 105851 <NA> FALSE 49 334 0 1147
## 105852 <NA> FALSE 39 208 1 801
## 105853 <NA> FALSE 285 1326 0 3480
## 105848 659 2020-01-08 13:06:03 FALSE <NA>
## 105849 2785 2019-08-07 04:35:56 FALSE <NA>
## 105850 15036 2010-04-30 15:30:39 FALSE <NA>
## 105851 1824 2019-08-19 15:10:40 FALSE <NA>
## 105852 1378 2013-08-14 06:24:51 FALSE <NA>
## 105853 16825 2016-03-04 13:07:19 FALSE <NA>
## 105848 <NA> NA
## 105849 <NA> NA
## 105850 <NA> NA
## 105851 <NA> NA
## 105852 <NA> NA
## 105853 <NA> NA
## 105848 <NA>
## 105849 <NA>
## 105852 <NA>
## 105853 <NA>
## 105848 <NA>
## 105849 <NA>
## 105851 <NA>
## 105853 <NA>
##
profile_image_url
## 105848
http://pbs.twimg.com/profile_images/1214897021569470464/PsmQvyxI_normal.jpg
## 105849
http://pbs.twimg.com/profile_images/1158960946838044673/8WZueqx5_normal.jpg
## 105850
http://pbs.twimg.com/profile_images/688317640897638403/ELaY-ZEX_normal.jpg
## 105851
http://pbs.twimg.com/profile_images/1163468598603444224/Ia-OmyqY_normal.jpg
## 105852
http://pbs.twimg.com/profile_images/378800000293680962/f80a13a608555e74bc6c43f883e9eb03_n
ormal.jpeg
## 105853
http://pbs.twimg.com/profile_images/1207229736100884480/s4fAGDXh_normal.jpg
Retrive orginal tweets
# Remove retweets
movie_tweets_original <- post_tweets_data[post_tweets_data$is_retweet==FALSE, ]
# Remove replies
movie_tweets_original <- subset(movie_tweets_original,
is.na(movie_tweets_original$reply_to_status_id))
Page | 77
favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets)
# favorite_count
movie_tweets_original <- movie_tweets_original %>% arrange(-favorite_count)
## [1] "2020-01-11 10:05:20"
#retweet_count
movie_tweets_original <- movie_tweets_original %>% arrange(-retweet_count)
## [1] "2020-01-11 10:05:20"
SHOW THE RATIO OF REPLIES/RETWEETS/ORIGINAL TWEETS
# dataset containing only the retweets and one containing only the replies.
# Keeping only the retweets

movie_retweets <- post_tweets_data[post_tweets_data$is_retweet==TRUE,]
# Keeping only the replies

movie_replies <- subset(post_tweets_data, !is.na(post_tweets_data$reply_to_status_id))
Create a separate data frame containing the number of original tweets, retweets, and replies
# Creating a data frame
original_count <- nrow(movie_tweets_original)

retweets_count <- nrow(movie_retweets)
replies_count <- nrow(movie_replies)
movie_data <- data.frame(

category=c("Original", "Retweets", "Replies"),
count=c(original_count, retweets_count, replies_count )
)
# Adding columns
movie_data$fraction = movie_data$count / sum(movie_data$count)
movie_data$percentage = movie_data$count / sum(movie_data$count) * 100
movie_data$ymax = cumsum(movie_data$fraction)
movie_data$ymin = c(0, head(movie_data$ymax, n=-1))
# Rounding the movie_data to two decimal points

movie_data <- round_df(movie_data, 2)
# Specify what the legend should say

Type_of_Tweet <- paste(movie_data$category, movie_data$percentage, "%")
ggplot(movie_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type_of_Tweet)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
Page | 78
SHOW THE SOURCE DEVICES FROM WHERE THE TWEETS ARE PUBLISHED
tw_app <- pre_tweets_data %>%

select(source) %>%
group_by(source) %>%
summarize(count=n())
tw_app <- subset(tw_app, count > 1000)
device_data <- data.frame(

category=tw_app$source,
count=tw_app$count
)
device_data$fraction = device_data$count / sum(device_data$count)
device_data$percentage = device_data$count / sum(device_data$count) * 100
device_data$ymax = cumsum(device_data$fraction)
device_data$ymin = c(0, head(device_data$ymax, n=-1))
device_data <- round_df(device_data, 2)
Source <- paste(device_data$category, device_data$percentage, "%")
ggplot(device_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built
initially
xlim(c(2, 4)) +
theme_void() +
Page | 79
SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS
#Cleaning the data

movie_tweets_original$text <- gsub("https\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("@\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("amp", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[\r\n]", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[[:punct:]]", "", movie_tweets_original$text)
# remove stop words from the text
tweets <- movie_tweets_original %>%

select(text) %>%
unnest_tokens(word, text)
tweets <- tweets %>%
anti_join(stop_words)
## Joining, by = "word"
Plot the most frequent words found in the tweets
# gives a bar chart of the most

frequent words found in the tweets
tweets %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in the tweets of the movie",
subtitle = "Stop words removed from the list")
Page | 80
## Selecting by n
SHOW THE MOST FREQUENTLY USED HASHTAGS
movie_tweets_original$hashtags <- as.character(movie_tweets_original$hashtags)

movie_tweets_original$hashtags <- gsub("c\\(", "", movie_tweets_original$hashtags)
set.seed(1234)
wordcloud(movie_tweets_original$hashtags, min.freq=50, scale=c(2, 1), random.order=FALSE,
rot.per=0.35, colors=brewer.pal(8, "Dark2"))
## drops documents
Page | 81
SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE
set.seed(1234)
wordcloud(post_tweets_data$retweet_screen_name, min.freq=200, scale=c(2, .5),
## drops documents
Page | 82
SHOWS THE LOCATION FROM WHICH THE MOST TWEETS BELONGS
set.seed(1234)
wordcloud(post_tweets_data$location, min.freq=400, scale=c(3, 1), random.order=FALSE,

rot.per=0.25,
## drops documents
Page | 83
SHOWS THE LOCATION FROM WHICH THE MOST RETWEETS BELONGS
set.seed(1234)
wordcloud(post_tweets_data$retweet_location, min.freq=200, scale=c(3, 1),

## drops documents
Page | 84
PERFORM A SENTIMENT ANALYSIS OF THE TWEETS ( “syuzhet” package )
# Converting tweets to ASCII to trackle strange characters

tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")
# removing retweets, in case needed

tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)
# removing mentions, in case needed

tweets <-gsub("@\\w+","",tweets)
ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))
names(sentimentscores) <- "Score"
sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)
rownames(sentimentscores) <- NULL
ggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiments")+ylab("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()
Page | 85
CLEANING THE DATA
post_tweets_data$text = gsub("&amp", "", post_tweets_data$text)

post_tweets_data$text = gsub("&amp", "", post_tweets_data$text)
post_tweets_data$text = gsub("rt|RT", "", post_tweets_data$text) # remove Retweet
post_tweets_data$text = iconv(post_tweets_data$text, "latin1", "ASCII", sub="") # Remove
emojis/dodgy unicode
post_tweets_data$text = gsub("<(.*)>", "", post_tweets_data$text) # Remove pesky Unicodes
like <U+A>
post_tweets_data$text = gsub("https(.*)*$", "", post_tweets_data$text) # remove tweet URL
post_tweets_data$text = gsub("www[[:alnum:][:punct:]]*","",
tolower(post_tweets_data$text ))
post_tweets_data$text = gsub("<.*?>", "", post_tweets_data$text) # remove html tags
post_tweets_data$text = gsub("@\\w+", "", post_tweets_data$text) # remove at(@)
post_tweets_data$text = gsub("[[:punct:]]", "", post_tweets_data$text) # remove
punctuation
post_tweets_data$text = gsub("\r?\n|\r", " ", post_tweets_data$text) # remove /n
post_tweets_data$text = gsub("[[:digit:]]", " ", post_tweets_data$text) # remove
numbers/Digits
post_tweets_data$text = gsub("[ |\t]{2,}", " ", post_tweets_data$text) # remove tabs
post_tweets_data$text = gsub("^ ", "", post_tweets_data$text) # remove blank spaces at
the beginning
post_tweets_data$text = gsub(" $", "", post_tweets_data$text) # remove blank spaces at
the end
head(post_tweets_data$text)
## [1] "movie tanhaji is not holiday release so comparing its advance with holiday
releases only show ur jealous soul"
## [2] "i can see a ceain honesty in the early reviews of tanhaji which are coming out
this is so bloody rare in these times and the kind of things people are sayingi am now so
so excited to catch it at the earliest tanhajireview"
Page | 86
## [3] "onewordreview tanhaji superb rating cr film tanhajireview"
## [4] "tanhajitheunsungwarrior media screening repo is epic my friend who is attending

the screening in mumbai saying its best film of career tanhaji"
## [5] "what a film yaaar tanhaji tanhajireview"
## [6] "onewordreview tanhaji superb rating cr film tanhajireview"
Create subset of tweets text
set.seed(777) # Make process reproducible

sub_blogs =
post_tweets_data$text[sample(length(post_tweets_data$text),length(post_tweets_data$text)*
0.1)] # make subset
Creating a corpus and cleaning data
sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus

sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white
spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeWords , stopwords("english")) # Remove
english stop words
Tokenizing, calculating frequencies and making plots of n-grams
n_grams_plot <- function(n, data) {
options(mc.cores=1)
# Builds n-gram tokenizer

tk <- function(x)
NGramTokenizer(x, Weka_control(min = n, max = n))
# Create matrix
ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
# make matrix for easy view
ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
# find 20 most frequent n-grams in the matrix
ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))
# plots
ggplot(ngrams_matrix, aes(x=word, y=freq)) +
geom_bar(stat="Identity", fill="pink", colour="black") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") +
ylab("Frequency")
}
Page | 87
Page | 88
Create Corpus and define the rating of the movie based on the score given by get_sentiment function
sent.value <- get_sentiment(post_tweets_data$text)
Page | 89
corpus_tw = Corpus(VectorSource(post_tweets_data$text))
corpus_tw = tm_map(corpus_tw, tolower)

## Warning in tm_map.SimpleCorpus(corpus_tw, tolower): transformation drops
## documents
corpus_tw = tm_map(corpus_tw, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_tw, removePunctuation): transformation
## drops documents
corpus_tw = tm_map(corpus_tw, removeWords, c(stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus_tw, removeWords, c(stopwords("english"))):
## transformation drops documents
corpus_tw = tm_map(corpus_tw, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus_tw, stemDocument): transformation drops
## documents
frequencies_tw = DocumentTermMatrix(corpus_tw)
sparse_tw = removeSparseTerms(frequencies_tw, 0.995)
sparse_tw.df = as.data.frame(as.matrix(sparse_tw))
colnames(sparse_tw.df) = make.names(colnames(sparse_tw.df))
#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value >= 0 & sent.value
< 0.5, "Average",
# ifelse(sent.value >= 0.5 & sent.value < 1, "Good",
# ifelse(sent.value >=1 & sent.value < 1.5 ,"Very
Good","Excellent"))))
Classify the tweets based on the scores provided by get_sentiment function into 5 categories.
category_sentiment <- ifelse(sent.value < 0, 1, ifelse(sent.value == 0 , "Ignore",

ifelse(sent.value > 0 & sent.value < 0.5, 2,
ifelse(sent.value >=0.5 & sent.value < 1 ,3,
ifelse(sent.value >= 1 & sent.value < 1.5 ,4,5)))))
sparse_tw.df$Polarity = category_sentiment
table(sparse_tw.df$Polarity)
##
## 1 2 3 4 5 Ignore
## 8942 8203 17346 15614 24895 30853
Remove the data which has Ignore value in their Y column
sparse_tw_new.df <- filter(sparse_tw.df, Polarity != "Ignore")
table(sparse_tw_new.df$Polarity)
Page | 90
##
## 1 2 3 4 5
## 8942 8203 17346 15614 24895
ggplot(data=sparse_tw_new.df, aes(Polarity, fill=Polarity))+ geom_bar()
BUILD CLASSIFICATION MODELS AND PREDICT FOR POST RELEASE ANALYSIS
We will use different classification models and check its accuracy and performance. Polarity
will be the dependant variable.
prop.table(table(sparse_tw_new.df$Polarity))
##
## 1 2 3 4 5
## 0.1192267 0.1093733 0.2312800 0.2081867 0.3319333
Extract random 5000 observations from the DTM
#model_data <- sparse_tw_new.df %>% sample_frac(0.10)
model_data <- sample_n(sparse_tw_new.df, 5000)

dim(model_data)
## [1] 5000 461
Split the data into 70-30 ratio for Train and Test
library(caTools)
set.seed(777)
model_data$Polarity <- as.factor(model_data$Polarity)
spl = sample.split(model_data$Polarity, SplitRatio = 0.7)
Page | 91
train_data = subset(model_data, spl == TRUE)
test_data = subset(model_data, spl == FALSE)
prop.table(table(train_data$Polarity))
##
## 1 2 3 4 5
## 0.1114286 0.1088571 0.2397143 0.2117143 0.3282857
prop.table(table(test_data$Polarity))
##
## 1 2 3 4 5
## 0.1113333 0.1086667 0.2400000 0.2120000 0.3280000
Build CART Model
# Load the Libraries

library(rpart)
library(rpart.plot)
movie_cart_model = rpart(Polarity ~ ., data=train_data, method="class")
#CART Diagram
prp(movie_cart_model, extra=2)
Predict and Evaluate the Performance of CART train data
predict_cart_train_post = predict(movie_cart_model, data=train_data, type="class")

confusion_matrix_cart <- table(train_data$Polarity, predict_cart_train_post)
Page | 92
## predict_cart_train_post
## 1 2 3 4 5
## 1 102 6 0 0 282
## 2 0 150 0 0 231
## 3 1 2 181 1 654
## 4 1 0 8 392 340
## 5 0 0 6 2 1141
# Baseline accuracy
accuracy_cart_train_post = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_train_post
## [1] 0.5617143
Predict and Evaluate the Performance of CART on test data
predict_cart_test_post = predict(movie_cart_model, newdata=test_data, type="class")

confusion_matrix_cart <- table(test_data$Polarity, predict_cart_test_post)
## predict_cart_test_post
## 1 2 3 4 5
## 1 33 5 0 2 127
## 2 0 68 0 0 95
## 3 2 0 72 0 286
## 4 0 2 4 170 142
## 5 0 0 4 3 485
# Baseline accuracy
accuracy_cart_test_post = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_test_post
## [1] 0.552
library(pROC)

roc_obj_cart_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_cart_train_post),quiet
= TRUE)
roc_obj_cart_train_post
##
## Call:
as.numeric(predict_cart_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_train_post) with 5 levels of
roc_obj_cart_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_post),quiet =
TRUE)
roc_obj_cart_test_post
##
## Call:
Page | 93
as.numeric(predict_cart_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_post) with 5 levels of
Comparison of all the performace measure of CART Model on Train and Test dataset
results_cart_train_post = data.frame(accuracy_cart_train_post,
as.numeric(roc_obj_cart_train_post$auc))
names(results_cart_train_post) = c("ACCURACY", "AUC-ROC" )
results_cart_test_post =
data.frame(accuracy_cart_test_post,as.numeric(roc_obj_cart_test_post$auc) )
names(results_cart_test_post) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_cart_train_post, results_cart_test_post)

row.names(df_fin) = c('CART Train Post', 'CART Test Post')
df_fin
## ACCURACY AUC-ROC
## CART Train Post 0.5617143 0.6011670
## CART Test Post 0.5520000 0.5984982
Build Random Forest Model
# Load Library
library(randomForest)
set.seed(777)
movie_rf_model = randomForest(Polarity ~ ., data=train_data,importance=TRUE)

movie_rf_model
##
## Call:
## randomForest(formula = Polarity ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 10.06%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 302 4 60 4 20 0.22564103
## 2 13 311 37 6 14 0.18372703
## 3 11 6 782 9 31 0.06793802
## 4 5 2 39 654 41 0.11740891
## 5 10 1 26 13 1099 0.04351610
Predict and Evaluate the Performance of Random Forest on train
# Make predictions:
predict_rf_train_post = predict(movie_rf_model, data=train_data,type="response")

confusion_matrix_rf <- table(train_data$Polarity, predict_rf_train_post)
confusion_matrix_rf
Page | 94
## predict_rf_train_post
## 1 2 3 4 5
## 1 302 4 60 4 20
## 2 13 311 37 6 14
## 3 11 6 782 9 31
## 4 5 2 39 654 41
## 5 10 1 26 13 1099
accuracy_rf_train_post = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_train_post
## [1] 0.8994286
Predict and Evaluate the Performance of Random Forest on test data
# Make predictions:
predict_rf_test_post = predict(movie_rf_model, newdata=test_data,type="response")

confusion_matrix_rf <- table(test_data$Polarity, predict_rf_test_post)
confusion_matrix_rf
## predict_rf_test_post
## 1 2 3 4 5
## 1 129 3 20 4 11
## 2 4 140 11 1 7
## 3 10 0 342 1 7
## 4 0 1 13 292 12
## 5 2 1 11 6 472
accuracy_rf_test_post = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_test_post
## [1] 0.9166667
Variable Importance of Random Forest
#Variable importance:
#varImpPlot(movie_rf_model,main='Variable Importance Plot: Movie ',type=2)
varImpPlot(movie_rf_model,sort=TRUE,type=NULL, class=NULL, scale=TRUE,cex=.8)
Page | 95

roc_obj_rf_train_post <- multiclass.roc(as.numeric(train_data$Polarity),
as.numeric(predict_rf_train_post),quiet = TRUE)
roc_obj_rf_train_post
##
## Call:
as.numeric(predict_rf_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_rf_train_post) with 5 levels of
roc_obj_rf_test_post <- multiclass.roc(as.numeric(test_data$Polarity),
as.numeric(predict_cart_test_post),quiet = TRUE)
roc_obj_rf_test_post
##
## Call:
as.numeric(predict_cart_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_post) with 5 levels of
Comparison of all the performace measure of Random Forest Model on Train and Test dataset
results_rf_train_post = data.frame(accuracy_rf_train_pre,
as.numeric(roc_obj_rf_train_post$auc))
names(results_rf_train_post) = c("ACCURACY", "AUC-ROC" )
Page | 96
results_rf_test_post =
data.frame(accuracy_rf_test_pre,as.numeric(roc_obj_rf_test_post$auc) )
names(results_rf_test_post) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_rf_train_post, results_rf_test_post)

row.names(df_fin) = c('Random Forest Train Post', 'Random Forest Test Post')
df_fin
## ACCURACY AUC-ROC
## Random Forest Train Post 0.9354286 0.9173600
## Random Forest Test Post 0.9326667 0.5984982
Build SVM Model
set.seed(123)
library(e1071)
movie_svm_model = svm(Polarity ~ . , data = train_data)

## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'format' and 'backstab' and 'thackrey' and 'alluarjun' and 'boestim'
## and 'earlytrend' constant. Cannot scale data.
Predict and Evaluate the performance of SVM Model on train data
# Make predictions:
predict_svm_train_post = predict(movie_svm_model, data=train_data, decision.values=TRUE)

confusion_matrix_svm <- table(train_data$Polarity, predict_svm_train_post)
## predict_svm_train_post
## 1 2 3 4 5
## 1 120 0 209 2 59
## 2 0 185 117 3 76
## 3 0 0 763 1 75
## 4 0 0 189 401 151
## 5 0 0 82 7 1060
accuracy_svm_train_post = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_train_post
## [1] 0.7225714
Predict and Evaluate the performance of SVM Model on test data
# Make predictions:
predict_svm_test_post = predict(movie_svm_model, newdata=test_data,decision.values=TRUE)

confusion_matrix_svm <- table(test_data$Polarity, predict_svm_test_post)
## predict_svm_test_post
## 1 2 3 4 5
## 1 32 0 99 4 32
## 2 0 83 45 3 32
## 3 0 0 327 0 33
## 4 0 0 74 177 67
## 5 0 0 47 0 445
Page | 97
accuracy_svm_test_post = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_test_post
## [1] 0.7093333
AUC-ROC Curve for SVM model on Train and Test dataset
library(pROC)

roc_obj_svm_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_svm_train_post),quiet=T
RUE)
roc_obj_svm_train_post
##
## Call:
as.numeric(predict_svm_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_svm_train_post) with 5 levels of
roc_obj_svm_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_svm_test_post),quiet=TRU
E)
roc_obj_svm_test_post
##
## Call:
as.numeric(predict_svm_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_svm_test_post) with 5 levels of
Comparison of all the performace measure of SVM Model on Train and Test dataset
results_svm_train_post = data.frame(accuracy_svm_train_post,
as.numeric(roc_obj_svm_train_post$auc))
names(results_svm_train_post) = c("ACCURACY", "AUC-ROC" )
results_svm_test_post =
data.frame(accuracy_svm_test_post,as.numeric(roc_obj_svm_test_post$auc) )
names(results_svm_test_post) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_svm_train_post, results_svm_test_post)

row.names(df_fin) = c('SVM Train Post', 'SVM Test Post')
df_fin
## ACCURACY AUC-ROC
## SVM Train Post 0.7225714 0.7700893
## SVM Test Post 0.7093333 0.7583435
Page | 98
Build Naive Bayes Model
set.seed(777)
movie_nb_model = naiveBayes(Polarity ~ . , usekernel=T, data = train_data)
Predict and Evaluate the performance of NB Model on train data

# Make predictions:
predict_nb_train_post = predict(movie_nb_model, train_data, type = "class")

confusion_matrix_nb <- table(train_data$Polarity, predict_nb_train_post)
confusion_matrix_nb
## predict_nb_train_post
## 1 2 3 4 5
## 1 267 119 0 4 0
## 2 39 341 0 1 0
## 3 148 550 117 24 0
## 4 131 461 0 149 0
## 5 612 293 13 22 209
accuracy_nb_train_post = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_train_post
## [1] 0.3094286
Predict and Evaluate the performance of NB Model on test data
# Make predictions:
predict_nb_test_post = predict(movie_nb_model,newdata = test_data, type = "class")

confusion_matrix_nb <- table(test_data$Polarity, predict_nb_test_post)
confusion_matrix_nb
## predict_nb_test_post
## 1 2 3 4 5
## 1 98 63 0 6 0
## 2 14 144 0 5 0
## 3 57 234 58 11 0
## 4 54 196 1 67 0
## 5 255 131 7 10 89
accuracy_nb_test_post = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_test_post
## [1] 0.304
AUC-ROC Curve for Naive Bayes model on Train and Test dataset
library(pROC)
#library(ROCR)
roc_obj_nb_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_nb_train_post),quiet=TR
UE)
roc_obj_nb_train_post
##
## Call:
Page | 99
as.numeric(predict_nb_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_nb_train_post) with 5 levels of
roc_obj_nb_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_nb_test_post),quiet=TRUE
)
roc_obj_nb_test_post
##
## Call:
as.numeric(predict_nb_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_nb_test_post) with 5 levels of
Comparison of all the performace measure of Naive Bayes Model on Train and Test dataset
results_nb_train_post = data.frame(accuracy_nb_train_post,
as.numeric(roc_obj_nb_train_post$auc))
names(results_nb_train_post) = c("ACCURACY", "AUC-ROC" )
results_nb_test_post =
data.frame(accuracy_nb_test_post,as.numeric(roc_obj_nb_test_post$auc) )
names(results_nb_test_post) = c("ACCURACY", "AUC-ROC")
df_fin =rbind(results_nb_train_post, results_nb_test_post)

row.names(df_fin) = c('Naive Bayes Train Post', 'Naive Bayes Test Post')
df_fin
## ACCURACY AUC-ROC
## Naive Bayes Train Post 0.3094286 0.6422574
## Naive Bayes Test Post 0.3040000 0.6245372
Comparing best of both the models - LDA and SVM with their performance - Accuracy , Sensitivity,
Specificity and AUC-ROC Curve
df_fin =rbind(results_cart_train_post, results_cart_test_post, results_rf_train_post,

results_rf_test_post,
results_svm_train_post,results_svm_test_post,results_nb_train_post,results_nb_test_post)
row.names(df_fin) = c('CART Train Post', 'CART Test Post','Random Forest Train

Post','Random Forest Test Post', 'SVM Train Post','SVM Test Post','Naive Bayes Train
Post','Naive Bayes Test Post')
#install.packages("kableExtra")
library(kableExtra)
print("Model Performance Comparison Metrics ")
## [1] "Model Performance Comparison Metrics "
kable(round(df_fin,2)) %>%
kable_styling(c("striped","bordered"))
Page | 100
Page | 101

Interim Project - Sentiment Analysis of Movie

Uploaded by

Copyright:

Available Formats

Interim Project - Sentiment Analysis of Movie

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interim Project - Sentiment Analysis of Movie

Uploaded by

Copyright:

Available Formats

An

Pre Release & Post Release Sentiment Analysis of

Great Lakes Institute of Management

1. Scope, Objective & Problem Statement

1 Poor ; Score < 0

2 Average ; Score >= 0 & Score < 0.5

3 Good ; Score >= 0.5 & Score < 1

4 Very Good; Score>=1 & Score <1.5

5 Excellent ; Score >= 1.5

2. Data Source and Description

1. Create application on twitter.

3. Change the permissions to read and write.

4. Retrieve Authentication keys.

• Consumer Secret key

• Access token secret

Figure 1: Twitter data from different Sources

Algorithm for Sentiment Analysis:

Process Flow Diagram

Data Scraping &

Figure 2: Process Flow diagram

The following tweet preprocessing options were tested:

#BoxOffice Report Day 3 Early Estimates: #DeepikaPadukone's “fails”,

Table 2: Example showing tweet and polarity words

c) Extraction and Tokenization

4. Exploratory data analysis of the data

Figure 3: Donut chart for type of tweets

b) Plot of most frequent words in the text

Figure 4: Bar plot for most occurring movie tweets

Figure 5: Uni-gram distribution

Figure 6: Bi-gram distribution

Figure 7: Tri-gram distribution

c) World Cloud of the keywords of the tweets

Figure 8: Word cloud for frequent words

Figure 9: Word cloud for retweets accounts

Figure 10: Word cloud for tweets location

Figure 13: CART Confusion Matrix for pre-release matrix

Figure 14: CART Confusion Matrix for post-release matrix

 Random Forest Model

Figure15: Variable Importance Plot for pre-release tweets

Figure 16: Random Forest Confusion Matrix for pre-release tweets

 Naïve Bayes Classifier

 Support Vector Machines

Figure 20: SVM Model Confusion Matrix for pre-release tweets

Figure 21: SVM Model Confusion Matrix for post-release tweets

Note: This shows that Random Forest Model is performing best.

6. Recommendations & Applications

7. Challenges and Limitations

 A study on feature selection & classification algorithms–

tanhaji_all sample data_definition.csv

Following shows the R-code & outputs.

#tweet1 <- read.csv("tanhaji_set3.csv",stringsAsFactors = FALSE)

#tanhaji_all <- rbind(tweet1,tweet2)

movie_tweets_data <- read.csv("tanhaji_all.csv", stringsAsFactors = FALSE)

Sort the dataset in assending order with date

movie_tweets_data <- movie_tweets_data %>%

## url protected followers_count friends_count listed_count statuses_count

release_date <- as.Date("01/10/2020", format = "%m/%d/%Y")

MOVIE PRE RELEASE ANALYSIS

pre_tweets_data <- movie_tweets_data %>%