Interim Project - Sentiment Analysis of Movie

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 101

An

Interim report on

Pre Release & Post Release Sentiment Analysis of


Upcoming Movies

Submitted By
Group No. 8 Batch: APR- 2019 Location: Bengaluru

Group Members
Saurav Suman – BABAPR19053
Neha Tiwary – BABAPR19057
Divya Thomas – BABAPR19018
Anurag Kedia – BABAPR19011
Peehu – BABAPR19071

Research Supervisor
Mr. Deepak Sharma

Great Lakes Institute of Management

Page | 1
Contents

1. Introduction:................................................................................................................................3
2. Scope, Objective & Problem Statement.......................................................................................3
3. Data Source and Description.......................................................................................................5
a) Data Source.............................................................................................................................5
b) Data Description......................................................................................................................7
4. Data Pre-processing.....................................................................................................................8
a) Data Cleaning..........................................................................................................................8
b) Creation of Word Corpuses.....................................................................................................9
c) Extraction and Tokenization....................................................................................................9
d) DTM (Document Term Matrix)...............................................................................................9
5. Exploratory data analysis of the data.........................................................................................10
a) Visualization of retweets.......................................................................................................11
b) Plot of most frequent words in the text..................................................................................11
c) World Cloud of the keywords of the tweets...........................................................................13
d) Word Cloud for Account from which most retweets originate..............................................14
e) Word Cloud for the Location from where most of the tweets originate.................................14
6. Modelling Approach..................................................................................................................15
a. Techniques and software to be used......................................................................................15
b. Model Building......................................................................................................................16
c. Comparison chart for accuracy & AUC value of the all Model for Pre-release datasets........20
d. Comparison chart for accuracy & AUC value of the all Model for Post-release datasets......21
7. Recommendations & Applications............................................................................................21
8. Challenges and Limitations.......................................................................................................21
9. References and Bibliography.....................................................................................................22
10. Appendix...................................................................................................................................23

Page | 2
1. Introduction:
Movies are the most convenient way to entertain people. However, only few movies get higher
success and are ranked high. Many movies are produced by the movie industry in a year. Movie
revenue depends on various components such as cast acting in a movie, budget for the making of the
movie, film critics review, rating for the movie, release year of the movie, etc. Because of these
multiple components there is no formula that helps us to provide analysis for predicting how much
revenue a particular movie will be generating.
However, by analysing the revenues generated by previous movies, a model can be built which can
help us predict the expected revenue for a particular movie.
As we know in today’s world the movie is one of the biggest sources of entertainment and also for
business purposes. If we were able to predict the movie success rate in the correct manner then it
will be easy for the businessman to get higher profit from it and also if the prediction shows the
success rate is low of certain movie then it helps those businessmen to improve the content of the
movie so that they can get higher revenue from it. Success rate of movies, models and mechanisms
can be used to predict the success of a movie.
It will help the business significantly. Stakeholders such as Actors, Producers, directors and Event
Company etc. can use these predictions to make more informed decisions. They can make the
decision before the movie release.
Social media such as Twitter, YouTube, Facebook have been used for sharing contents and
comments on all types of subjects by millions of people on a daily basis. It is clear that businesses
have a strong interest in tapping into these huge data sources to extract information that might
improve their decision-making process.
For example, predictive models derived from social media for successful movies may facilitate
filmmakers making more profitable decisions. The topic of movies is one of the considerable
interests in the social media user community among all class of people.

1. Scope, Objective & Problem Statement


The widespread usage of the internet has enabled people to share their views with the rest of the
world online. This method of broadcasting opinions has gained a lot of popularity ever since.
However, this led to a decrease in the quality of opinions that were shared. Due to this, people find it
challenging and difficult to browse through all the opinions.
Opinions requiring a text description such as movie reviews, acting, direction, songs reviews would
be much less prone to incorrect/invalid responses. Predicting the Sentiment of people for movie
using the tweets would help provide the success ratio of the movie. The data is collected only from
the well-known Social Media platform like Twitter.
People are very active on Twitter these days and they start sharing all their views right from movie
poster release till the movie is running in theatres by giving their opinion.
Movie reviews have been used before for sentiment analysis. We expect that comments express the
same range of opinions and subjectivity as the movie reviews.

Page | 3
Sentiment analysis aims to uncover the attitude of the person on a particular topic from the written
text. Other terms used to denote this research area include “opinion mining” and “subjectivity
detection”. It uses natural language processing and machine learning techniques to find statistical
and/or linguistic patterns in the text that reveal attitudes.
It has gained popularity in recent years due to its immediate applicability in business environment,
such as summarizing feedback from the product reviews, discovering collaborative
recommendations, or assisting in election campaigns. The focus of our project is the analysis of the
sentiments in the short web site comments.
We expect the short comment to express succinctly and directly person’s opinion on any movie. We
focus on two important properties of text: 1. subjectivity – whether the style of the sentence is
subjective or objective; 2. polarity – whether the person expresses positive or negative opinion. We
use statistical methods to capture the elements of subjective style and the sentence polarity.
Statistical analysis is done on the sentence level. We apply machine learning techniques to classify
set of messages. We are interested in the following questions:
1. To what extent can we extract the subjectivity and polarity from the short comments? What are
the important features that can be extracted from the raw text that have the greatest influence on the
classification?
2. What machine learning techniques are suitable for this purpose?

Problem Statement 1
 Based on the twitter data, what are the sentiments of the people pre and post movie release

Objective 1:
Identify the hashtags related to the movie that would help to garner more tweet information on the
movie which has to be analysed for the sentiment analysis.

Problem Statement 2
 Movie rating categorization based on the polarity of the tweets.

Objective 2:
Using the polarity score categorize the rating levels like bad, average, good, excellent etc. towards
any movie before its release and predict the overall rating of the movie based on the Twitter data.

After the release of movie we have collected the set of fresh tweets and build the model to check if
the prediction of our model gives the same result or not, which will help us to conclude whether the
expectation of the people before movie release is similar to the sentiment analysis of post movie
release.
 To meet our Scope, we have built a model which is capable enough to predict the sentiment of
people for multiple movies which will help others to compare different movie on people’s
sentiment based on twitter

Page | 4
 Also, it is capable to predict the sentiment of people from their tweets provided to model for
different time frames. (Example: 7days or 15 days tweets pre & post release of movie)

We would analyse the data and classify the emotions into following bins as per the Score
obtained for each tweet:

No of stars Rating

1 Poor ; Score < 0

2 Average ; Score >= 0 & Score < 0.5

3 Good ; Score >= 0.5 & Score < 1

4 Very Good; Score>=1 & Score <1.5

5 Excellent ; Score >= 1.5

Table 1
“Score” is the varaibale which we get from sent.value parameterafter passing the tweets

2. Data Source and Description


a) Data Source

Twitter API

Twitter is an innovative microblogging service aired in 2006 with currently more than 550 million
users. The user created status messages are termed tweets by this service. The public timeline of
twitter service displays tweets of all users worldwide and is an extensive source of real-time
information.
The original concept behind microblogging was to provide personal status updates. But the current
scenario surprisingly witnesses tweets covering everything under the world, ranging from current
political affairs to personal experiences.
Movie reviews, travel experiences, current events etc. add to the list. Tweets (and microblogs in
general) are different from reviews in their basic structure. While reviews are characterized by
formal text patterns and are summarized thoughts of authors, tweets are more casual and restricted to
140 characters of text. Tweets offer companies an additional avenue to gather feedback.
Sentiment analysis to research products, movie reviews etc. aid customers in decision making before
making a purchase or planning for a movie. Enterprises find this area useful to research public
opinion of their company and products, or to analyze customer satisfaction.
Organizations utilize this information to gather feedback about newly released products which
supplements in improving further design, as twitter dataset is easily available and has more impact,
so we are using only Twitter data as our main dataset.

Page | 5
Scrapping of data from Twitter:
We can access the Twitter data through the public API which is provided by the Twitter. These APIs
can be accessed only by authentication requests, which must be signed with valid login ID and
password. The authentication keys are provided by Twitter through which we can do the Tweet
extraction. Few steps need to be followed to create the Authentication keys and those steps are as
follows:

1. Create application on twitter.

2. Manage Application

3. Change the permissions to read and write.

4. Retrieve Authentication keys.

After finishing the entire process, we get the unique keys. These Unique keys are required for
collection of tweets from tweeter. The Unique Keys are:

• Consumer key

• Consumer Secret key

• Access token

• Access token secret


The tweets which are collected from the twitter are having information like twitter ID, user ID, date
of tweets, retweet counts etc. for our analysis we are using only the tweets, the tweet date and the
tweet ID. We combine the API to our app so that we can collect all the tweets which is related to our
movie selection and also the comments, controversies or news related to that particular movie.

Figure 1: Twitter data from different Sources

Page | 6
b) Data Description
We will be using the data of 7 and 15 days of pre and post movie release. Sample dataset of the
movie “Tanahji” is mentioned in the appendix section. We will be using the text column of the
dataset for our sentiment analysis. Other columns like source, retweet_count, hashtag, created_at, etc
are being used to understand the overview of the dataset.

Algorithm for Sentiment Analysis:

 The first phase was data acquisition. Here we choose Twitter as our data sources.
 Second phase was data cleaning. After scrapping data from various sources, we have cleaned our
data mainly on unavailability of some features.
 After cleaning all data, next phase is data integration and transformation. In third phase we
classified some features and create corpus of the text data.
 Fourth phase is Sentiment analysis of the Tweets. We use get_sentiment function to get the score
of each tweet and further build the dtm of the tweets which also has a column with the score
value.
 In the Fifth phase, the dataset in divided into two sections: Training dataset includes 70% and
testing dataset includes 30% of the total dataset.
 Sixth phase is Result and Analysis, where we will run our data through different classification
models on our dataset and check the accuracy and AUC value.
 As it is said in Analytics that “Higher the accuracy, better the result”.

Process Flow Diagram

Start

Data Scraping &


Collection using Predict the
Twitter API Attitude of the Result & Analysis
Reviewer
(Twitter Data)

Model Generation
Data Processing,
Using Machine End
Feature
Learning
Engineering
Algorithm

Splitting the
Data Visualization dataset into train
and test datasets

Figure 2: Process Flow diagram

Page | 7
3. Data Pre-processing
a) Data Cleaning
Tweets containing both positive and negative emoticons were not taken into account. The list of
positive emoticons used for labeling the training set includes :), :-),), :D, and =), while the list of
negative emoticons consists of :(, :-(, and: (. Inevitably this simplification results in partially correct
or noisy labeling. The emoticons were stripped out of the training data for the classifier to learn from
other features that describe the tweets.
The tweets were manually labeled based on their sentiment, regardless of the presence of emoticons
in the tweets. As the Twitter community has created its own language to post messages, we explore
the unique properties of this language to better define the feature space.

The following tweet preprocessing options were tested:

 Removal of html tags & symbol “@”. The hyperlinks often present in these tweets in turn
restrict the vocabulary size
 Removal of tweet URL
 Remove pesky Unicode like <U+A>
 Removal of punctuation marks. The basic approach to deal with this is to remove everything that
isn’t a standard number or letter. It should be borne in mind that sometimes punctuations can be
really useful, like web addresses, where the punctuation often defines the web address.
Therefore, the removal of punctuation should be tailored to the specific problem. In our case, we
will remove all punctuations.
 Another pre-processing task we have to do is to remove unhelpful terms. Many words are
frequently used but are only meaningful in a sentence. These are called stop words. Examples
are ‘the’, ‘is’, ‘at’, and ‘which’. It’s unlikely that these words will improve our ability to
understand sentiments, so we want to remove them to reduce the size of the data.
 We change the case of the word to lowercase so that same words are not counted as different
because of lower or upper case.
 Removal of numbers or digits
 Removal of blank spaces both from the beginning and the end of the tweet.

In addition to Twitter-specific text preprocessing, other standard preprocessing steps were performed
to define the feature space for tweet feature vector construction. These include text tokenization,
removal of stopwords, stemming, N-gram construction (concatenating 1 to N stemmed words
appearing consecutively) and using minimum word frequency for feature space reduction.
The resulting terms were used as features in the construction of TF-IDF feature vectors representing
the documents (tweets). TF-IDF stands for term frequencyinverse document frequency feature
weighting scheme where weight reflects how important a word is to a document in a document
collection.

Page | 8
b) Creation of Word Corpuses
A positive word corpus contains all possible positive words which are usually used in tweets
similarly a negative word corpus is also created.
A word corpus contains many numbers of words since tweets are created by various people around
the world in their own style.
So, we had to consider all possible words for the corresponding word corpus especially the analysis
is for Hollywood and Bollywood movies.

Positive Negative
Tweet
Words Words

#BoxOffice Report Day 3 Early Estimates: #DeepikaPadukone's “fails”,


“excellent”
#Chhapaak fails miserably, #AjayDevgn's #Tanhaji excellent “miserably”

Table 2: Example showing tweet and polarity words

While bringing this tweet for analysis, the positive word corpus compares all the tokens and will find
the words “special”, “montage” and assigns a polarity count. The tweet will again pass through the
negative word corpus and will find the word “clichés” and assigns a polarity count. All other words
will be neglected by the word corpuses since they have neutral polarity.

c) Extraction and Tokenization


A number of tweets are collected for testing and are stored in a file. These are extracted one by one
from the file using R program. A sentence in the file is considered as one tweet by keeping
punctuation mark as its base. That is whenever there is a punctuation mark between any two words,
till the previous word before the punctuation mark is taken as one sentence.
Likewise, all the sentences are extracted. Each word in the sentence is also extracted by considering
space between the words as its base.
That is whenever there appears a space between letters in the sentence, the letters before the blank
space is considered as one word. Likewise, all the words in the sentence are extracted and each word
is called as one “token”.
The processing of making tokens from sentences is named as tokenization. Both extraction and
Tokenization are performed one by one.
That is one sentence will be considered at a time. And tokenization will be performed in it. And it
will be proceeded for further steps. After completion of all the steps in one sentence only, another
sentence will be extracted and proceeded for further steps.

Page | 9
d) DTM (Document Term Matrix)
A document-term matrix or term-document matrix is a mathematical matrix that describes the
frequency of terms that occur in a collection of documents. In a document-term matrix, rows
correspond to documents in the collection and columns correspond to terms. There are various
schemes for determining the value that each entry in the matrix should take. One such scheme is  tf-
idf.
Document Term Matrix is tracking the term frequency for each term by each document. It starts with
the Bag of Words representation of the documents and then for each document we can track the
number of time a term exists. Term count is a common metric to use in a Document Term Matrix.

4. Exploratory data analysis of the data


Sentiment analysis is broadly classified in the two types first one is a feature or aspect-based
sentiment analysis and the other is objectivity-based sentiment analysis. The tweets related to movie
reviews come under the category of the feature-based sentiment analysis.
Objectivity based sentiment analysis does the exploration of the tweets which are related to the
emotions like hate, miss, love etc. In statistics, exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us
beyond the formal modeling or hypothesis testing task.The EDA approach is precisely that--an
approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be
carried out.
Different approaches which include machine learning (ML) techniques, sentiment lexicons, hybrid
approaches etc. have been proved useful for sentiment analysis on formal texts. But their
effectiveness for extracting sentiment in microblogging data will have to be explored.
A careful investigation of tweets reveals that the 140-character length text restricts the vocabulary
which imparts the sentiment. The varied domains discussed would surely impose hurdles for
training. The frequency of misspellings and slang words in tweets (microblogs in general) is much
higher than in other language resources which is another hurdle that needs to be overcome.
On the other way around the tremendous volume of data available from microblogging websites on
varied domains are incomparable with other data resources available. Microblogging language is
characterized by expressive punctuations which convey a lot of sentiments. Bold lettered phrases,
exclamations, question marks, quoted text etc. leave scope for sentiment extraction.
The proposed work attempts a novel approach on twitter data by aggregating an adapted polarity
lexicon which has learnt from product reviews of the domains under consideration, the tweet specific
features and unigrams to build a classifier model using machine learning techniques.
Here are few EDA techniques which tell us about the microblogging launguage in clear manner and
can build strong point about any movie sentiment. Also, these technique gives us few important
notices about how the movie is trending in the market.

NOTE: Here, all the outputs have been considered for the movie “Tanhaji”

Page | 10
a) Visualization of retweets
Here we are presenting the nature of the tweets which we collected. The below Donut chart is
showing the proportion of the tweets.

Figure 3: Donut chart for type of tweets

b) Plot of most frequent words in the text


The below plot is showing the words which are most frequently used in the tweets collected.

Figure 4: Bar plot for most occurring movie tweets

 N-gram plots: Bag of Words ignores the semantic context of the review and concentrates
primarily on frequency of each word. To overcome that, we also tried ngram modelling wherein

Page | 11
we created unigrams, bigrams and mixture of both. While creating unigrams is more or less
similar to the bag of words approach, bigrams provided more contextual information on the
review text.

Plot of frequency distribution of uni-gram: The type of models that assign probabilities to the
sequences of single word.

Figure 5: Uni-gram distribution

Plot of frequency distribution of bi-gram: The type of models that assign probabilities to the
sequences of two words

Figure 6: Bi-gram distribution

Page | 12
Plot of frequency distribution of Tri-gram: The type of models that assign probabilities to the
sequences of three words

Figure 7: Tri-gram distribution

c) World Cloud of the keywords of the tweets


As word cloud is a text mining method to find the most frequently used words in a text. Here is the
Word Cloud for the most frequent words in the tweets.

Figure 8: Word cloud for frequent words

Page | 13
d) Word Cloud for Account from which most retweets originate
This Word Cloud shows the account from where most of the retweets have been generated.

Figure 9: Word cloud for retweets accounts

e) Word Cloud for the Location from where most of the tweets originate 

Figure 10: Word cloud for tweets location

Page | 14
5. Modelling Approach
a. Techniques and software to be used

 Excel
 R/Python:

Sentiment prediction has been a great area of research in the recent times and is a challenging task
especially in morphologically rich languages. The task requires us to classify a given sentence either
as "Positive" or "Negative". In order to do this, we went ahead and tried out multiple deep learning-
based methods.
In this project, we are applying data mining techniques, machine learning algorithms, using several
feature extraction techniques used in text mining and understand their relevance to our problem.
We will develop a methodology on the basis of historical data and current data available from the
data source i.e. Twitter to understand any movie’s viewer sentiment outcome.
We could predict the rating of a given movie just based on its summary but quite often, we have
reviews for the movie along with their summary which could be utilized in order to improve the
prediction capability of our networks. Using the sentiment of these reviews as a prior along with the
summary aids the task at hand.
In order to generate sentiment priors, we used the sentiment classification network mentioned in the
previous section. The actual task of predicting the rating is a regression problem where the output
would be a single floating value between 0.0 to 5.0, in order to simplify the task, we convert it into a
classification problem where we round the true rating to the nearest integer which would give us a
five-class classification problem.
Following are the key variables used during our analysis
 Text (tweets)
 Re-tweet count
 Screen name
 Date of tweet/ re-tweet
 Favorite_count
 hastags
 retweetfavorite_count
 verified

From above list of variables, Text (Tweets) & hastags are the most important variable used for
analysis; however other categories are finally contributing in the final insight of textual data.
Sentiment analysis refers to the use of natural language processing, text analysis and computational
linguistics to extract and identify subjective information in source materials.
We will provide both pre- release tweets & post- release tweets for EDA building and generating
polarity based on their tweets. Here, we are using get_nrc_sentiment (tweets) to find the sentiment of
all the tweets passed.
The get_sentiments function returns a tibble, so to take a look at what is included as “positive” and
“negative” sentiment

Page | 15
Figure 11: Sentiment graph from tweets generated

b. Model Building
We have passed the dataset for both pre and post release to meet our objective of finding the
sentiment of people and comparing it for before the movie release and after it is released.
For these we have built few Classification Model for predicting different levels of rating

 CART Model

“CART: Classification and Regression Trees” is a machine learning algorithm for classification and
regression.Here in the model, CART algorithm mainly works with dividing of recursive training
dataset into partitions to get pure target class, where every node in the tree is related to specific
record set split by a test based on selected feature.

Here is the output of the CART Tree, which represent the different sentiment of people for the
movie “Tanhaji”

Page | 16
Figure 12: Decision Tree

The evaluation of performance is judged by the confusion matrix. It is specific table layout that
allows visualization of the performance of an algorithm.
Confusion matrix for CART model for the pre-release tweets for test dataset which brings the
accuracy of 71.86%.

Figure 13: CART Confusion Matrix for pre-release matrix

Confusion matrix for CART model for the post-release tweets which brings the accuracy of 55.2%

Figure 14: CART Confusion Matrix for post-release matrix

 Random Forest Model

Random Forest classifier provides two types of randomness, first is with respect to data and second
is with respect to features. Random Forest classifier uses the concept of Bagging and Bootstrapping.

Page | 17
As Random Forest is the combination of decision Trees, it deals with multiple number of
hyperparameters which are:
 Number of Trees to construct for the Decision Forest
 Number of features to select at random
 Depth of each trees.

All these hyperparameters are required to be set manually which will be time consuming and does
not guarantee that it will give good results for the parameter that we have set manually. Each of the
hyperparameters have their own importance and influence towards the output prediction. There are
two measures of importance given for each variable in the random forest.
The first measure is based on how much the accuracy decreases when the variable is excluded. The
second measure is based on the decrease of Gini impurity when a variable is chosen to split a node.

Figure15: Variable Importance Plot for pre-release tweets

Confusion matrix for Random Forest model for the post-release tweets which brings the accuracy of
93.26%.

Figure 16: Random Forest Confusion Matrix for pre-release tweets

Confusion matrix for Random Forest model for post-release tweets which brings the accuracy of
91.66%

Page | 18
Figure 17: Random Forest Confusion Matrix for post-release tweets

 Naïve Bayes Classifier


In this model first all the tweets and labels are passed to the classifier. In the next step feature
extraction is done. Now, both these extracted features and tweets are passed to the Naïve Bayesian
classifier. Then train the classifier with this training data. Then the classifier dump file opened in
write back mode and feature words are stored in it along with a classifier. After that the file is close.
Confusion matrix for Naïve Bayes model for pre-release tweets which brings the accuracy of 48.2%

Figure 18: Naïve Bayes Model Confusion Matrix for pre-release tweets

Confusion matrix for Naïve Bayes model for post-release tweets which brings the accuracy of 30.4%

Figure 19: Naïve Bayes Model Confusion Matrix for post-release tweets

 Support Vector Machines


For SVM, we have basically used 3 labels that are 0, 1 and 2. Here the 0 represents positive, 1
represent negative and 2 as neutral. Each word in a tweet is represented as either 0 or 1. If it is
feature word, then represent it with 1 otherwise 0. So, we get a sequence of 0s and 1s. Now this
feature vector and class labels are given to an SVM classifier to classify tweets as positive, Negative,
Neutral.

Page | 19
Confusion matrix for SVN model for pre-release tweets which brings the accuracy of 73.4%

Figure 20: SVM Model Confusion Matrix for pre-release tweets

Confusion matrix for SVN model for post-release tweets which brings the accuracy of 72.25%

Figure 21: SVM Model Confusion Matrix for post-release tweets

c. Comparison chart for accuracy & AUC value of the all Model for Pre-release datasets

Table 3

Page | 20
d. Comparison chart for accuracy & AUC value of the all Model for Post-release datasets

Table 4

Note: This shows that Random Forest Model is performing best.

6. Recommendations & Applications


 This model helps us to find out the sentiments of people towards any movie.
 Another application of this project would be to find a group of viewers with similar movie tastes
(likes or dislikes).
 The sentiment of reviews is a valuable source which would lead to more accurate rating
predictions and would help people to know how movie is faring at the Box-office from a pubic
point of view.
 The sentiment analysis can be helpful for immediately identifying any situations like the
reaction of people towards the teaser or trailer and can give necessary insights to PR &
Management team.

7. Challenges and Limitations


 Various hashtags point out to topic and it is hard to cover all the hashtags which will add any
value to our sentiment analysis task.
 Misleading tweets may influence our analysis.
 Locations of the all users are not available in the data.
 The model is only based on twitter data
 Sarcastic comments are another hurdle that needs to be overcome as it is difficult to make it
differentiate for positive and negative sense.
 As the project is based on twitter data so scraping the data from twitter in a regular interval is
important as twitter only provides data of last 7 to 9 days of any string or hashtag
 This model can also be used for old movies on a condition that one should have data scrapped
from Twitter for the particular movie whose analysis is required.

Page | 21
 Another task in sentiment analysis is subjectivity/objectivity identification where it focuses on
classifying a given text (usually a sentence) into one of the two classes (objective or subjective).
As the subjectivity of words and phrases may depend on their context and an objective
document may contain subjective sentences (a news article quoting people's opinions), this
problem can sometimes be more difficult than polarity classification.

Page | 22
References and Bibliography

 A study on feature selection & classification algorithms–


https://ieeexplore.ieee.org/document/7522583
 Deep learning for sentiment analysis–https://cs224d.stanford.edu/reports/PouransariHadi.pdf
 Movie reviews using Logistic Regression–https://itnext.io/machine-learning-sentiment-analysis-
of-movie-reviews-using-logisticregression-62e9622b4532
 Natural Language Processing SoSe 2016 –
https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/NaturalLanguageProcessing/
NLP2016/NLP09_SentimentAnalysis.pdf
 Opinion Mining on Twitter Data of Movie Reviews using R–
https://pdfs.semanticscholar.org/aad9/3c7978ddc2378e781f62ea12fece439f6d7f.pdf
 Sentiment Analysis – Wikipedia – https://en.wikipedia.org/wiki/Sentiment_analysis
 Sentiment Analysis of Movie Review Using Text Mining–https://acadpubl.eu/hub/2018-119-
16/2/374.pdf
 Sentiment Analysis of Movie Reviews using Machine Learning Techniques–
https://www.researchgate.net/publication/321843804_Sentiment_Analysis_of_Movie_Reviews_
using_Machine_Learning_Techniques
 Sentiment Analysis of Movie Reviews–https://machinelearningmastery.com/prepare-movie-
review-data-sentiment-analysis/
 Tidy text mining - https://www.tidytextmining.com/tidytext.html

Page | 23
8. Appendix

Below is the attached fie for sample dataset and data definition file for movie “Tanhaji”

tanhaji_all sample data_definition.csv


dataset.xlsx

Following shows the R-code & outputs.


R Notebook
#install.packages("twitteR")
#install.packages("RCurl")
#install.packages("httr")
#install.packages("syuzhet")
#install.packages("rtweet")
#install.packages("forestmangr")
#install.packages("tidytext")
#install.packages("slam")
library(twitteR)
library(rtweet)
##
## Attaching package: 'rtweet'
## The following object is masked from 'package:twitteR':
##
## lookup_statuses
library(RCurl)
library(httr)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
##
## content
library(wordcloud)
## Loading required package: RColorBrewer
library(syuzhet)
##
## Attaching package: 'syuzhet'
## The following object is masked from 'package:rtweet':
##
## get_tokens
library(dplyr)
##
## Attaching package: 'dplyr'

Page | 24
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forestmangr)
library(tidytext)
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#setwd("E:\Study\Capstone")

#tweet1 <- read.csv("tanhaji_set3.csv",stringsAsFactors = FALSE)


#tweet2 <- read.csv("tanhaji_set4.csv",stringsAsFactors = FALSE)

#tanhaji_all <- rbind(tweet1,tweet2)

#write.csv(tanhaji_all,"tanhaji_all.csv")
setwd("E:\\Study\\Capstone")

movie_tweets_data <- read.csv("tanhaji_all.csv", stringsAsFactors = FALSE)

Sort the dataset in assending order with date

movie_tweets_data <- movie_tweets_data %>%


arrange(created_at)

head(movie_tweets_data)
## X.1 X user_id status_id created_at screen_name
## 1 66439 66439 2.340511e+09 1.212471e+18 2020-01-01 20:31:06 bjadams156
## 2 66440 66440 2.340511e+09 1.212471e+18 2020-01-01 20:31:06 bjadams156
## 3 60459 60459 4.021178e+08 1.212553e+18 2020-01-02 01:54:20 A_Jay_FanNepal
## 4 56533 56533 1.148293e+18 1.212556e+18 2020-01-02 02:07:49 MizarPradyum
## 5 55158 55158 1.126794e+18 1.212557e+18 2020-01-02 02:12:43 ABHI_ADholic04
## 6 56467 56467 1.148293e+18 1.212558e+18 2020-01-02 02:13:47 MizarPradyum
##
text

Page | 25
## 1 Asking this only to @ajaydevgn fans ..in KRK style !!\nHow many of you feel that
#RohitShetty is making full chuti*a of Devgn on the name of <U+0001F449> mere bade Hain
bhaiya re !!\n#TanhajiTheUnsungWarrior \n#Tanhaji #BhujThePrideOfIndia\nAjay fans ke
alawa koi aur yahan text Kiya to UMKB <U+0001F923>
## 2 Asking this only to @ajaydevgn fans ..in KRK style !!\nHow many of you feel that
#RohitShetty is making full chuti*a of Devgn on the name of <U+0001F449> mere bade Hain
bhaiya re !!\n#TanhajiTheUnsungWarrior \n#Tanhaji #BhujThePrideOfIndia\nAjay fans ke
alawa koi aur yahan text Kiya to UMKB <U+0001F923>
## 3
@AbhishekDudhai6 @NishantADHolic_ @ajaydevgn After #Tanhaji cannot wait to watch this
## 4
Dhokebaaj of the decade<U+0001F449> @deepikapadukone ... She betrayed the person who has
given her back to back 4 hits(LAK , Race2, Cocktail, Aarakshan) by clashing her movie
#Chhapaak with SAIF Sir's #Tanhaji
## 5
@krish_is_Devil @MizarPradyum Thanks bhai , watch it in 3D for better experience
#TanhajiTheUnsungWarrior #Tanhaji
## 6
40 k completed on BMS #TanhajiTheUnsungWarrior #Tanhaji
## source display_text_width reply_to_status_id reply_to_user_id
## 1 Twitter for iPhone 140 NA NA
## 2 Twitter for iPhone 140 NA NA
## 3 Twitter for Android 104 NA NA
## 4 Twitter for Android 140 NA NA
## 5 Twitter for Android 84 1.212438e+18 1.207668e+18
## 6 Twitter for Android 76 NA NA
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## 1 <NA> FALSE TRUE 0 58
## 2 <NA> FALSE TRUE 0 58
## 3 <NA> FALSE TRUE 0 4
## 4 <NA> FALSE TRUE 0 32
## 5 krish_is_Devil FALSE FALSE 4 0
## 6 <NA> FALSE TRUE 0 2
## quote_count reply_count hashtags symbols
## 1 NA NA RohitShetty NA
## 2 NA NA RohitShetty NA
## 3 NA NA Tanhaji NA
## 4 NA NA <NA> NA
## 5 NA NA c("TanhajiTheUnsungWarrior", "Tanhaji") NA
## 6 NA NA c("TanhajiTheUnsungWarrior", "Tanhaji") NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co media_expanded_url
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA>
## media_type ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 1 <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> NA
## mentions_user_id
## 1 c("368435117", "65659343")
## 2 c("368435117", "65659343")
## 3 c("110915886", "1109687428778999808", "2955267019", "65659343")
## 4 c("3853108342", "101695592")
## 5 c("1207667977618890752", "1148293149174820864")

Page | 26
## 6 1126794046528077824
## mentions_screen_name lang
## 1 c("teamrb_", "ajaydevgn") en
## 2 c("teamrb_", "ajaydevgn") en
## 3 c("racquel_erika", "AbhishekDudhai6", "NishantADHolic_", "ajaydevgn") en
## 4 c("ClassySaifian", "deepikapadukone") en
## 5 c("krish_is_Devil", "MizarPradyum") en
## 6 ABHI_ADholic04 en
## quoted_status_id quoted_text quoted_created_at quoted_source
## 1 NA <NA> <NA> <NA>
## 2 NA <NA> <NA> <NA>
## 3 NA <NA> <NA> <NA>
## 4 NA <NA> <NA> <NA>
## 5 NA <NA> <NA> <NA>
## 6 NA <NA> <NA> <NA>
## quoted_favorite_count quoted_retweet_count quoted_user_id quoted_screen_name
## 1 NA NA NA <NA>
## 2 NA NA NA <NA>
## 3 NA NA NA <NA>
## 4 NA NA NA <NA>
## 5 NA NA NA <NA>
## 6 NA NA NA <NA>
## quoted_name quoted_followers_count quoted_friends_count quoted_statuses_count
## 1 <NA> NA NA NA
## 2 <NA> NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## quoted_location quoted_description quoted_verified retweet_status_id
## 1 <NA> <NA> NA 1.212400e+18
## 2 <NA> <NA> NA 1.212400e+18
## 3 <NA> <NA> NA 1.212353e+18
## 4 <NA> <NA> NA 1.212400e+18
## 5 <NA> <NA> NA NA
## 6 <NA> <NA> NA 1.212430e+18
##
retweet_text
## 1 Asking this only to @ajaydevgn fans ..in KRK style !!\nHow many of you feel that
#RohitShetty is making full chuti*a of Devgn on the name of <U+0001F449> mere bade Hain
bhaiya re !!\n#TanhajiTheUnsungWarrior \n#Tanhaji #BhujThePrideOfIndia\nAjay fans ke
alawa koi aur yahan text Kiya to UMKB <U+0001F923>
## 2 Asking this only to @ajaydevgn fans ..in KRK style !!\nHow many of you feel that
#RohitShetty is making full chuti*a of Devgn on the name of <U+0001F449> mere bade Hain
bhaiya re !!\n#TanhajiTheUnsungWarrior \n#Tanhaji #BhujThePrideOfIndia\nAjay fans ke
alawa koi aur yahan text Kiya to UMKB <U+0001F923>
## 3
@AbhishekDudhai6 @NishantADHolic_ @ajaydevgn After #Tanhaji cannot wait to watch this
## 4
Dhokebaaj of the decade<U+0001F449> @deepikapadukone ... She betrayed the person who has
given her back to back 4 hits(LAK , Race2, Cocktail, Aarakshan) by clashing her movie
#Chhapaak with SAIF Sir's #Tanhaji
## 5
<NA>
## 6
40 k completed on BMS #TanhajiTheUnsungWarrior #Tanhaji
## retweet_created_at retweet_source retweet_favorite_count
## 1 2020-01-01 15:48:21 Twitter for Android 357
## 2 2020-01-01 15:48:21 Twitter for Android 357
## 3 2020-01-01 12:42:00 Twitter for Android 6

Page | 27
## 4 2020-01-01 15:48:40 Twitter for Android 51
## 5 <NA> <NA> NA
## 6 2020-01-01 17:45:27 Twitter for Android 11
## retweet_retweet_count retweet_user_id retweet_screen_name
## 1 58 3.684351e+08 teamrb_
## 2 58 3.684351e+08 teamrb_
## 3 4 1.109159e+08 racquel_erika
## 4 32 3.853108e+09 ClassySaifian
## 5 NA NA <NA>
## 6 2 1.126794e+18 ABHI_ADholic04
## retweet_name retweet_followers_count
## 1 REAL BOXOFFICE 16590
## 2 REAL BOXOFFICE 16590
## 3 forever aamirian 831
## 4 <U+2614><U+FE0F>CLASSY SAIFIAN<U+2614> 1770
## 5 <NA> NA
## 6 ABHI_Tanhaji04 570
## retweet_friends_count retweet_statuses_count retweet_location
## 1 1 5215 New Delhi, India
## 2 1 5215 New Delhi, India
## 3 72 11616 India
## 4 32 21547
## 5 NA NA <NA>
## 6 603 13246
##
retweet_description
## 1
Typing the truth below ..<U+0001F447><U+0001F64F>
## 2
Typing the truth below ..<U+0001F447><U+0001F64F>
## 3 only aamir khan rock my
world...i love u aamir khan for life..i know one day i will see aamir khan up close &
personal
## 4 ur Fav Actor mi8 b a bigger<U+2B50><U+FE0F>but definitely not a better INSAAN than
SAIF SIR..Sir is epitome of Class,Royalness& humbleness.(Fan Account of Megastar SAIF
Sir).
## 5
<NA>
## 6
Tanhaji on 10 jan , watch it in 3d
## retweet_verified place_url place_name place_full_name place_type country
## 1 FALSE <NA> <NA> <NA> <NA> <NA>
## 2 FALSE <NA> <NA> <NA> <NA> <NA>
## 3 FALSE <NA> <NA> <NA> <NA> <NA>
## 4 FALSE <NA> <NA> <NA> <NA> <NA>
## 5 NA <NA> <NA> <NA> <NA> <NA>
## 6 FALSE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 1 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 2 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 3 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 4 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 5 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 6 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url name
## 1 https://twitter.com/bjadams156/status/1212471332127948801 Brandon Adams
## 2 https://twitter.com/bjadams156/status/1212471332127948801 Brandon Adams
## 3 https://twitter.com/A_Jay_FanNepal/status/1212552672995356672 ADFnepal
## 4 https://twitter.com/MizarPradyum/status/1212556070024953856 Ajay Devgn fan
## 5 https://twitter.com/ABHI_ADholic04/status/1212557301426589698 ABHI_Tanhaji04

Page | 28
## 6 https://twitter.com/MizarPradyum/status/1212557570491047936 Ajay Devgn fan
## location
## 1 Winston-Salem,NC
## 2 Winston-Salem,NC
## 3 kathmandu, Nepal
## 4 Indiana, USA
## 5
## 6 Indiana, USA
## description url
## 1 I have a twin sister.I graduated from Glenn High School in 2013. <NA>
## 2 I have a twin sister.I graduated from Glenn High School in 2013. <NA>
## 3 Movie Lovers - @ajayDevgn - #maidaan #Tanhaji #bhuj #RRR <NA>
## 4 welcome to ajay devgn kingdom <NA>
## 5 Tanhaji on 10 jan , watch it in 3d <NA>
## 6 welcome to ajay devgn kingdom <NA>
## protected followers_count friends_count listed_count statuses_count
## 1 FALSE 1171 5002 24 64508
## 2 FALSE 1171 5002 24 64508
## 3 FALSE 4957 554 28 73249
## 4 FALSE 180 461 0 23128
## 5 FALSE 570 603 1 13246
## 6 FALSE 180 461 0 23128
## favourites_count account_created_at verified profile_url
## 1 62563 2014-02-13 03:26:57 FALSE <NA>
## 2 62563 2014-02-13 03:26:57 FALSE <NA>
## 3 16149 2011-10-31 15:40:10 FALSE <NA>
## 4 30028 2019-07-08 18:09:56 FALSE <NA>
## 5 4616 2019-05-10 10:20:10 FALSE <NA>
## 6 30028 2019-07-08 18:09:56 FALSE <NA>
## profile_expanded_url account_lang
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## profile_banner_url
## 1 <NA>
## 2 <NA>
## 3 https://pbs.twimg.com/profile_banners/402117759/1474271309
## 4 https://pbs.twimg.com/profile_banners/1148293149174820864/1575377431
## 5 https://pbs.twimg.com/profile_banners/1126794046528077824/1576308416
## 6 https://pbs.twimg.com/profile_banners/1148293149174820864/1575377431
## profile_background_url
## 1 http://abs.twimg.com/images/themes/theme1/bg.png
## 2 http://abs.twimg.com/images/themes/theme1/bg.png
## 3 http://abs.twimg.com/images/themes/theme1/bg.png
## 4 <NA>
## 5 <NA>
## 6 <NA>
## profile_image_url
## 1 http://pbs.twimg.com/profile_images/538421341054312451/1MyROBLw_normal.jpeg
## 2 http://pbs.twimg.com/profile_images/538421341054312451/1MyROBLw_normal.jpeg
## 3 http://pbs.twimg.com/profile_images/1111984479063531521/Vk48l5bv_normal.jpg
## 4 http://pbs.twimg.com/profile_images/1215181148839530496/NUdgSpAR_normal.jpg
## 5 http://pbs.twimg.com/profile_images/1208985936597409792/H6J4Ls-9_normal.jpg
## 6 http://pbs.twimg.com/profile_images/1215181148839530496/NUdgSpAR_normal.jpg
tail(movie_tweets_data)

Page | 29
## X.1 X user_id status_id created_at screen_name
## 174346 66582 142 1.214896e+18 1.216941e+18 2020-01-14 04:32:37 Chintan64138110
## 174347 66576 136 1.158960e+18 1.216941e+18 2020-01-14 04:32:41 Gajendr16824337
## 174348 66448 8 1.387810e+08 1.216941e+18 2020-01-14 04:32:43 onenonlymahakal
## 174349 66443 3 1.163468e+18 1.216941e+18 2020-01-14 04:32:44 shri0944
## 174350 66441 1 1.669707e+09 1.216941e+18 2020-01-14 04:32:45 pruthwirajdeo
## 174351 66442 2 7.057414e+17 1.216941e+18 2020-01-14 04:32:45 Aryaman111
##
text
## 174346 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174347 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174348 As per the reliable
sources #TanhajiTheUnsungWarrior will be declared national film of India by our home
minister @AmitShah in next parliament session.\n\nCongratulations @ajaydevgn @omraut
<U+0001F389><U+0001F389>#Tanhaji
## 174349 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174350 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174351 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## source display_text_width reply_to_status_id
## 174346 Twitter for Android 140 NA
## 174347 Twitter for Android 140 NA
## 174348 Twitter for Android 140 NA
## 174349 Twitter Web App 140 NA
## 174350 Twitter for Android 140 NA
## 174351 Twitter for Android 140 NA
## reply_to_user_id reply_to_screen_name is_quote is_retweet favorite_count
## 174346 NA <NA> FALSE TRUE 0
## 174347 NA <NA> FALSE TRUE 0
## 174348 NA <NA> FALSE TRUE 0
## 174349 NA <NA> FALSE TRUE 0
## 174350 NA <NA> FALSE TRUE 0
## 174351 NA <NA> FALSE TRUE 0
## retweet_count quote_count reply_count hashtags symbols
## 174346 25 NA NA Tanhaji NA
## 174347 25 NA NA Tanhaji NA
## 174348 14 NA NA TanhajiTheUnsungWarrior NA
## 174349 25 NA NA Tanhaji NA
## 174350 25 NA NA Tanhaji NA
## 174351 25 NA NA Tanhaji NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co
## 174346 <NA> <NA> <NA> <NA> <NA>
## 174347 <NA> <NA> <NA> <NA> <NA>
## 174348 <NA> <NA> <NA> <NA> <NA>
## 174349 <NA> <NA> <NA> <NA> <NA>
## 174350 <NA> <NA> <NA> <NA> <NA>

Page | 30
## 174351 <NA> <NA> <NA> <NA> <NA>
## media_expanded_url media_type ext_media_url ext_media_t.co
## 174346 <NA> <NA> <NA> <NA>
## 174347 <NA> <NA> <NA> <NA>
## 174348 <NA> <NA> <NA> <NA>
## 174349 <NA> <NA> <NA> <NA>
## 174350 <NA> <NA> <NA> <NA>
## 174351 <NA> <NA> <NA> <NA>
## ext_media_expanded_url ext_media_type mentions_user_id
## 174346 <NA> NA 2924521080
## 174347 <NA> NA 2924521080
## 174348 <NA> NA 2754072768
## 174349 <NA> NA 2924521080
## 174350 <NA> NA 2924521080
## 174351 <NA> NA 2924521080
## mentions_screen_name lang quoted_status_id quoted_text quoted_created_at
## 174346 davidfrawleyved en NA <NA> <NA>
## 174347 davidfrawleyved en NA <NA> <NA>
## 174348 RoninADfannn en NA <NA> <NA>
## 174349 davidfrawleyved en NA <NA> <NA>
## 174350 davidfrawleyved en NA <NA> <NA>
## 174351 davidfrawleyved en NA <NA> <NA>
## quoted_source quoted_favorite_count quoted_retweet_count quoted_user_id
## 174346 <NA> NA NA NA
## 174347 <NA> NA NA NA
## 174348 <NA> NA NA NA
## 174349 <NA> NA NA NA
## 174350 <NA> NA NA NA
## 174351 <NA> NA NA NA
## quoted_screen_name quoted_name quoted_followers_count
## 174346 <NA> <NA> NA
## 174347 <NA> <NA> NA
## 174348 <NA> <NA> NA
## 174349 <NA> <NA> NA
## 174350 <NA> <NA> NA
## 174351 <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location
## 174346 NA NA <NA>
## 174347 NA NA <NA>
## 174348 NA NA <NA>
## 174349 NA NA <NA>
## 174350 NA NA <NA>
## 174351 NA NA <NA>
## quoted_description quoted_verified retweet_status_id
## 174346 <NA> NA 1.216941e+18
## 174347 <NA> NA 1.216941e+18
## 174348 <NA> NA 1.216796e+18
## 174349 <NA> NA 1.216941e+18
## 174350 <NA> NA 1.216941e+18
## 174351 <NA> NA 1.216941e+18
##
retweet_text
## 174346 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174347 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.

Page | 31
## 174348 As per the reliable
sources #TanhajiTheUnsungWarrior will be declared national film of India by our home
minister @AmitShah in next parliament session.\n\nCongratulations @ajaydevgn @omraut
<U+0001F389><U+0001F389>#Tanhaji
## 174349 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174350 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 174351 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## retweet_created_at retweet_source retweet_favorite_count
## 174346 2020-01-14 04:30:55 Twitter Web App 99
## 174347 2020-01-14 04:30:55 Twitter Web App 99
## 174348 2020-01-13 18:55:21 Twitter for Android 23
## 174349 2020-01-14 04:30:55 Twitter Web App 99
## 174350 2020-01-14 04:30:55 Twitter Web App 99
## 174351 2020-01-14 04:30:55 Twitter Web App 99
## retweet_retweet_count retweet_user_id retweet_screen_name
## 174346 25 2924521080 davidfrawleyved
## 174347 25 2924521080 davidfrawleyved
## 174348 14 2754072768 RoninADfannn
## 174349 25 2924521080 davidfrawleyved
## 174350 25 2924521080 davidfrawleyved
## 174351 25 2924521080 davidfrawleyved
## retweet_name
## 174346 Dr David Frawley
## 174347 Dr David Frawley
## 174348 THE WHITE WOLF..!! #Tanhaji <U+0001F6A9><U+0001F6A9>
## 174349 Dr David Frawley
## 174350 Dr David Frawley
## 174351 Dr David Frawley
## retweet_followers_count retweet_friends_count retweet_statuses_count
## 174346 307220 153 23992
## 174347 307220 153 23992
## 174348 1106 150 14186
## 174349 307220 153 23992
## 174350 307220 153 23992
## 174351 307220 153 23992
## retweet_location
## 174346 Santa Fe, NM USA
## 174347 Santa Fe, NM USA
## 174348
## 174349 Santa Fe, NM USA
## 174350 Santa Fe, NM USA
## 174351 Santa Fe, NM USA
##
retweet_description
## 174346 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 174347 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 174348

## 174349 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda

Page | 32
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 174350 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 174351 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## retweet_verified place_url place_name place_full_name place_type country
## 174346 TRUE <NA> <NA> <NA> <NA> <NA>
## 174347 TRUE <NA> <NA> <NA> <NA> <NA>
## 174348 FALSE <NA> <NA> <NA> <NA> <NA>
## 174349 TRUE <NA> <NA> <NA> <NA> <NA>
## 174350 TRUE <NA> <NA> <NA> <NA> <NA>
## 174351 TRUE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 174346 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 174347 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 174348 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 174349 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 174350 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 174351 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url
## 174346 https://twitter.com/Chintan64138110/status/1216941161698349057
## 174347 https://twitter.com/Gajendr16824337/status/1216941181264777217
## 174348 https://twitter.com/onenonlymahakal/status/1216941189032640512
## 174349 https://twitter.com/shri0944/status/1216941193944190976
## 174350 https://twitter.com/pruthwirajdeo/status/1216941197249282050
## 174351 https://twitter.com/Aryaman111/status/1216941195357753345
## name location
## 174346 Chintan Kumar
## 174347 Gajendra Singh Shekhawat
## 174348 Nirmal Aj fan <U+0001F1EE><U+0001F1F3> Indore, India
## 174349 Shri (<U+0936><U+094D><U+0930><U+0940>) Bengaluru, India
## 174350 pruthwiraj
## 174351 Vivek
##
description
## 174346
Rashtravaadi | name changed for security reason |
## 174347
Friendly
## 174348
movie n cricket maniac,shiv bhakt n believer
## 174349 Proud Hindian (Hindu+Indian) \n<U+0C9C><U+0CC8>
<U+0CB6><U+0CCD><U+0CB0><U+0CC0> <U+0CB0><U+0CBE><U+0CAE><U+0CCD>
<U+0001F6A9>\n<U+0CAD><U+0CBE><U+0CB0><U+0CA4><U+0CCD> <U+0CAE><U+0CBE><U+0CA4><U+0CBE>
<U+0C95><U+0CBF> <U+0C9C><U+0CC8>
<U+0001F1EE><U+0001F1F3>\n<U+0CB5><U+0C82><U+0CA6><U+0CC7>
<U+0CAE><U+0CBE><U+0CA4><U+0CB0><U+0C82> <U+0001F1EE><U+0001F1F3>
## 174350

## 174351

## url protected followers_count friends_count listed_count statuses_count


## 174346 <NA> FALSE 6 77 0 610
## 174347 <NA> FALSE 13 258 0 63
## 174348 <NA> FALSE 647 2073 9 83860
## 174349 <NA> FALSE 49 334 0 1147
## 174350 <NA> FALSE 39 208 1 801
## 174351 <NA> FALSE 285 1326 0 3480
## favourites_count account_created_at verified profile_url

Page | 33
## 174346 659 2020-01-08 13:06:03 FALSE <NA>
## 174347 2785 2019-08-07 04:35:56 FALSE <NA>
## 174348 15036 2010-04-30 15:30:39 FALSE <NA>
## 174349 1824 2019-08-19 15:10:40 FALSE <NA>
## 174350 1378 2013-08-14 06:24:51 FALSE <NA>
## 174351 16825 2016-03-04 13:07:19 FALSE <NA>
## profile_expanded_url account_lang
## 174346 <NA> NA
## 174347 <NA> NA
## 174348 <NA> NA
## 174349 <NA> NA
## 174350 <NA> NA
## 174351 <NA> NA
## profile_banner_url
## 174346 <NA>
## 174347 <NA>
## 174348 https://pbs.twimg.com/profile_banners/138781014/1495135789
## 174349 https://pbs.twimg.com/profile_banners/1163468236496617474/1578248233
## 174350 <NA>
## 174351 <NA>
## profile_background_url
## 174346 <NA>
## 174347 <NA>
## 174348 http://abs.twimg.com/images/themes/theme1/bg.png
## 174349 <NA>
## 174350 http://abs.twimg.com/images/themes/theme1/bg.png
## 174351 <NA>
##
profile_image_url
## 174346
http://pbs.twimg.com/profile_images/1214897021569470464/PsmQvyxI_normal.jpg
## 174347
http://pbs.twimg.com/profile_images/1158960946838044673/8WZueqx5_normal.jpg
## 174348
http://pbs.twimg.com/profile_images/688317640897638403/ELaY-ZEX_normal.jpg
## 174349
http://pbs.twimg.com/profile_images/1163468598603444224/Ia-OmyqY_normal.jpg
## 174350
http://pbs.twimg.com/profile_images/378800000293680962/f80a13a608555e74bc6c43f883e9eb03_n
ormal.jpeg
## 174351
http://pbs.twimg.com/profile_images/1207229736100884480/s4fAGDXh_normal.jpg

Creating the varibales for pre and post dates of the moview release
# Edit the Release date of the movie in MM/DD/YYYY format

release_date <- as.Date("01/10/2020", format = "%m/%d/%Y")


release_date
## [1] "2020-01-10"
pre_last_week_date <- release_date - 7
pre_last_week_date
## [1] "2020-01-03"
post_first_week_date <- release_date + 7
post_first_week_date
## [1] "2020-01-17"

MOVIE PRE RELEASE ANALYSIS

Page | 34
Filter the pre release data from the dataset

pre_tweets_data <- movie_tweets_data %>%


filter(created_at >= pre_last_week_date & created_at < release_date)
head(pre_tweets_data)
## X.1 X user_id status_id created_at screen_name
## 1 54102 54102 1.054712e+18 1.212890e+18 2020-01-03 00:14:51 FanOfAjayDevgn1
## 2 54124 54124 1.054712e+18 1.212890e+18 2020-01-03 00:16:23 FanOfAjayDevgn1
## 3 63424 63424 1.184134e+18 1.212891e+18 2020-01-03 00:16:52 theeejay_muc
## 4 60620 60620 1.009151e+08 1.212892e+18 2020-01-03 00:21:51 Tanhaji_25Dec19
## 5 57625 57625 1.122538e+18 1.212897e+18 2020-01-03 00:40:44 Gopinat38606021
## 6 55184 55184 9.445703e+17 1.212900e+18 2020-01-03 00:56:17 AdiansNepal
##
text
## 1 With just days to the release
of #TanhajiTheUnsungWarriror, the makers are making the most of the time to promote the
movie. @ajaydevgn @TanhajiFilm @omraut @SharadK7 @itsKajolD #SaifAliKhan #Tanhaji
\nhttps://t.co/ewbaCNXKq2
## 2
8 days to go #Tanhaji https://t.co/qSsZ0HMA7R
## 3 Pan-Indian movie #Darbar overtakes Telugu biggie
#SarileruNekkevvaru and Hindi biggie #Tanhaji to become the most awaited movie! Let's hit
100K interests before the end of this weekend, folks! 200K before release!
<U+0001F60E><U+0001F918> https://t.co/1djPtbZBo5
## 4 #TANHAJI rated 15 by British censor board #BBFC\nRunning time 130m 30s\n#BBFCInsight
strong violence, bloody images\n#TanhajiTheUnsungWarrior a historical action drama set in
17th century in which a Maratha warrior embarks on a mission to recapture a hill fortress
taken by Mughal
## 5 Pan-Indian movie #Darbar overtakes Telugu biggie
#SarileruNekkevvaru and Hindi biggie #Tanhaji to become the most awaited movie! Let's hit
100K interests before the end of this weekend, folks! 200K before release!
<U+0001F60E><U+0001F918> https://t.co/1djPtbZBo5
## 6
Tanhaji in hindi belts and Darbar in South to have huge box office openings as per
interest shown in BMS both have 40.4K and more. Chhapak to start on a dull note
everything depends on WOM.#Tanhaji
## source display_text_width reply_to_status_id reply_to_user_id
## 1 Twitter for Android 139 NA NA
## 2 Twitter for Android 61 NA NA
## 3 Twitter for Android 140 NA NA
## 4 Twitter Web App 140 NA NA
## 5 Twitter for Android 140 NA NA
## 6 Twitter for Android 140 NA NA
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## 1 <NA> FALSE TRUE 0 37
## 2 <NA> FALSE TRUE 0 41
## 3 <NA> FALSE TRUE 0 172
## 4 <NA> FALSE TRUE 0 5
## 5 <NA> FALSE TRUE 0 172
## 6 <NA> FALSE TRUE 0 2
## quote_count reply_count hashtags symbols
## 1 NA NA TanhajiTheUnsungWarriror NA
## 2 NA NA Tanhaji NA
## 3 NA NA c("Darbar", "SarileruNekkevvaru", "Tanhaji") NA
## 4 NA NA c("TANHAJI", "BBFC", "BBFCInsight") NA
## 5 NA NA c("Darbar", "SarileruNekkevvaru", "Tanhaji") NA
## 6 NA NA <NA> NA
## urls_url urls_t.co urls_expanded_url
## 1 <NA> <NA> <NA>

Page | 35
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## media_url media_t.co
## 1 <NA> <NA>
## 2 http://pbs.twimg.com/media/ENSf8oHUUAI7mcr.jpg https://t.co/qSsZ0HMA7R
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## media_expanded_url media_type
## 1 <NA> <NA>
## 2 https://twitter.com/Meena_Iyer/status/1212770073535864832/photo/1 photo
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## ext_media_url ext_media_t.co
## 1 <NA> <NA>
## 2 http://pbs.twimg.com/media/ENSf8oHUUAI7mcr.jpg https://t.co/qSsZ0HMA7R
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## ext_media_expanded_url
## 1 <NA>
## 2 https://twitter.com/Meena_Iyer/status/1212770073535864832/photo/1
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## ext_media_type mentions_user_id mentions_screen_name lang quoted_status_id
## 1 NA 2390513293 PuneTimesOnline en NA
## 2 NA 143098087 Meena_Iyer en NA
## 3 NA 497770148 RajiniFC en NA
## 4 NA 139639456 BreakingViews4u en NA
## 5 NA 497770148 RajiniFC en NA
## 6 NA 565560313 PradeepBastola en NA
## quoted_text quoted_created_at quoted_source quoted_favorite_count
## 1 <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> NA
## quoted_retweet_count quoted_user_id quoted_screen_name quoted_name
## 1 NA NA <NA> <NA>
## 2 NA NA <NA> <NA>
## 3 NA NA <NA> <NA>
## 4 NA NA <NA> <NA>
## 5 NA NA <NA> <NA>
## 6 NA NA <NA> <NA>
## quoted_followers_count quoted_friends_count quoted_statuses_count
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA

Page | 36
## 6 NA NA NA
## quoted_location quoted_description quoted_verified retweet_status_id
## 1 <NA> <NA> NA 1.212779e+18
## 2 <NA> <NA> NA 1.212770e+18
## 3 <NA> <NA> NA 1.212737e+18
## 4 <NA> <NA> NA 1.212821e+18
## 5 <NA> <NA> NA 1.212737e+18
## 6 <NA> <NA> NA 1.212581e+18
##
retweet_text
## 1 With just days to the release
of #TanhajiTheUnsungWarriror, the makers are making the most of the time to promote the
movie. @ajaydevgn @TanhajiFilm @omraut @SharadK7 @itsKajolD #SaifAliKhan #Tanhaji
\nhttps://t.co/ewbaCNXKq2
## 2
8 days to go #Tanhaji https://t.co/qSsZ0HMA7R
## 3 Pan-Indian movie #Darbar overtakes Telugu biggie
#SarileruNekkevvaru and Hindi biggie #Tanhaji to become the most awaited movie! Let's hit
100K interests before the end of this weekend, folks! 200K before release!
<U+0001F60E><U+0001F918> https://t.co/1djPtbZBo5
## 4 #TANHAJI rated 15 by British censor board #BBFC\nRunning time 130m 30s\n#BBFCInsight
strong violence, bloody images\n#TanhajiTheUnsungWarrior a historical action drama set in
17th century in which a Maratha warrior embarks on a mission to recapture a hill fortress
taken by Mughal
## 5 Pan-Indian movie #Darbar overtakes Telugu biggie
#SarileruNekkevvaru and Hindi biggie #Tanhaji to become the most awaited movie! Let's hit
100K interests before the end of this weekend, folks! 200K before release!
<U+0001F60E><U+0001F918> https://t.co/1djPtbZBo5
## 6
Tanhaji in hindi belts and Darbar in South to have huge box office openings as per
interest shown in BMS both have 40.4K and more. Chhapak to start on a dull note
everything depends on WOM.#Tanhaji
## retweet_created_at retweet_source retweet_favorite_count
## 1 2020-01-02 16:52:00 TweetDeck 137
## 2 2020-01-02 16:18:12 Twitter for iPhone 205
## 3 2020-01-02 14:06:41 Twitter Web App 410
## 4 2020-01-02 19:39:22 Twitter Web App 18
## 5 2020-01-02 14:06:41 Twitter Web App 410
## 6 2020-01-02 03:48:21 Twitter for Android 3
## retweet_retweet_count retweet_user_id retweet_screen_name
## 1 37 2390513293 PuneTimesOnline
## 2 41 143098087 Meena_Iyer
## 3 172 497770148 RajiniFC
## 4 5 139639456 BreakingViews4u
## 5 172 497770148 RajiniFC
## 6 2 565560313 PradeepBastola
## retweet_name retweet_followers_count retweet_friends_count
## 1 Pune Times 67861 1261
## 2 Meena Iyer 4414 103
## 3 Rajinikanth Fans <U+0001F918> 56964 275
## 4 Breaking Movies 23424 1709
## 5 Rajinikanth Fans <U+0001F918> 56964 275
## 6 Pradeep Bastola 143 51
## retweet_statuses_count retweet_location
## 1 16903 Pune, India
## 2 2243 India
## 3 29618
## 4 84747 FB.com/BreakMovies
## 5 29618
## 6 1458 Nepal

Page | 37
##
retweet_description
## 1 Official handle of Pune Times. Follow for news about the
city and updates from Bollywood and the Marathi entertainment industry
## 2 Influencer, CEO, Ajay Devgn FFilms, Ex-EDITOR Bombay Times and DNA
After Hours, Author Khullam Khulla. Retweets are not endorsements.
## 3 Rajini FC is a tribute to Superstar Rajinikanth handled by Team Rajinists. Updates
on Thalaivar Rajnikanth's next films and more. #Darbar #Thalaivar168
## 4
!!!!! Engine out, completely !!!!!
## 5 Rajini FC is a tribute to Superstar Rajinikanth handled by Team Rajinists. Updates
on Thalaivar Rajnikanth's next films and more. #Darbar #Thalaivar168
## 6 Tweets are personal, RTs are not
endorsement. I love my country, sports, movies and Medical Science.
## retweet_verified place_url place_name place_full_name place_type country
## 1 TRUE <NA> <NA> <NA> <NA> <NA>
## 2 TRUE <NA> <NA> <NA> <NA> <NA>
## 3 FALSE <NA> <NA> <NA> <NA> <NA>
## 4 FALSE <NA> <NA> <NA> <NA> <NA>
## 5 FALSE <NA> <NA> <NA> <NA> <NA>
## 6 FALSE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 1 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 2 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 3 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 4 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 5 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 6 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url
## 1 https://twitter.com/FanOfAjayDevgn1/status/1212890025676816384
## 2 https://twitter.com/FanOfAjayDevgn1/status/1212890414467837953
## 3 https://twitter.com/theeejay_muc/status/1212890536119545858
## 4 https://twitter.com/Tanhaji_25Dec19/status/1212891789150961664
## 5 https://twitter.com/Gopinat38606021/status/1212896540500512768
## 6 https://twitter.com/AdiansNepal/status/1212900453937111040
## name location
## 1 <U+0001F31F>Fan Of Ajay Devgn<U+0001F31F>
## 2 <U+0001F31F>Fan Of Ajay Devgn<U+0001F31F>
## 3 Theeejay Germany
## 4 Sagar New Delhi, India
## 5 Gopinath
## 6 ADF NEPAL Kathmandu
##
description
## 1
TANHAJI \ni love Ajay sir
## 2
TANHAJI \ni love Ajay sir
## 3
Hi, here's is Theeejay! Main Twitter account: @theeejay
## 4

## 5

## 6 @ajaydevgn fan club Nepal. Die hard fan of the King of intensity & versatility, Two
Time national award winner & the real action hero. undisputed king of clash<U+0001F4AA>
## url protected followers_count friends_count listed_count statuses_count
## 1 <NA> FALSE 200 73 0 14911
## 2 <NA> FALSE 200 73 0 14911

Page | 38
## 3 <NA> FALSE 79 52 0 599
## 4 <NA> FALSE 1477 1091 11 26284
## 5 <NA> FALSE 354 713 2 30727
## 6 <NA> FALSE 101 265 0 3182
## favourites_count account_created_at verified profile_url
## 1 28093 2018-10-23 12:31:40 FALSE <NA>
## 2 28093 2018-10-23 12:31:40 FALSE <NA>
## 3 490 2019-10-15 15:47:25 FALSE <NA>
## 4 31826 2010-01-01 05:57:11 FALSE <NA>
## 5 38991 2019-04-28 16:27:36 FALSE <NA>
## 6 1282 2017-12-23 14:08:01 FALSE <NA>
## profile_expanded_url account_lang
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## profile_banner_url
## 1 https://pbs.twimg.com/profile_banners/1054711956664434688/1555838130
## 2 https://pbs.twimg.com/profile_banners/1054711956664434688/1555838130
## 3 https://pbs.twimg.com/profile_banners/1184133591908921345/1576187331
## 4 https://pbs.twimg.com/profile_banners/100915145/1571653346
## 5 <NA>
## 6 https://pbs.twimg.com/profile_banners/944570289953824768/1514039933
## profile_background_url
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 http://abs.twimg.com/images/themes/theme1/bg.png
## 5 <NA>
## 6 http://abs.twimg.com/images/themes/theme1/bg.png
## profile_image_url
## 1 http://pbs.twimg.com/profile_images/1194577883181699073/S4UiPutd_normal.jpg
## 2 http://pbs.twimg.com/profile_images/1194577883181699073/S4UiPutd_normal.jpg
## 3 http://pbs.twimg.com/profile_images/1187238869311381504/AeBQsav4_normal.jpg
## 4 http://pbs.twimg.com/profile_images/1164493859386085376/cgxrrBl5_normal.png
## 5 http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png
## 6 http://pbs.twimg.com/profile_images/1212555422906769411/uw-yV2xx_normal.jpg
tail(pre_tweets_data)
## X.1 X user_id status_id created_at
## 67488 173184 106744 7.084906e+17 1.215422e+18 2020-01-09 23:56:04
## 67489 173181 106741 1.210608e+18 1.215422e+18 2020-01-09 23:56:21
## 67490 173168 106728 7.936731e+17 1.215422e+18 2020-01-09 23:56:58
## 67491 173167 106727 1.938105e+08 1.215422e+18 2020-01-09 23:57:10
## 67492 173166 106726 8.528880e+17 1.215423e+18 2020-01-09 23:58:32
## 67493 173165 106725 2.986419e+09 1.215423e+18 2020-01-09 23:59:27
## screen_name
## 67488 vk9378
## 67489 DeepakK98376858
## 67490 dev4Ind
## 67491 Ornawalla
## 67492 KrishnamitraHKJ
## 67493 vishalmellark
##
text
## 67488 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...

Page | 39
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67489
"Bharpoor Entertainer, Lajawab Film #Tanhaji".- Journalist #TanhajiReview\n\nBest Of
Luck &amp; Congratulations In Advance !!!\n@ajaydevgn and Fans !!!
#TanhajiTheUnsungWarrior https://t.co/0oV0N3TDpn
## 67490 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67491 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67492
Me:- 2morrow my a/c will be @verified &amp; we both will go to see #Tanhaji also
#Boycott_Chhapaak\n\nTwitter :-
<U+0001F447><U+0001F447><U+0001F937><U+0001F3FB><U+200D><U+2642><U+FE0F><U+0001F648><U+00
01F602> https://t.co/eQm6hUrwYm
## 67493
#Thalaiva the day 1 business of #Darbar &gt; #Tanhaji + #Chapaak more than double it is
actually if the worldwide fig <U+0001F60D> one sun .. one moon .. one star ..
superstar ..
## source display_text_width reply_to_status_id
## 67488 Twitter for Android 140 NA
## 67489 Twitter for Android 143 NA
## 67490 Twitter for iPhone 140 NA
## 67491 Twitter for iPhone 140 NA
## 67492 Twitter for Android 133 NA
## 67493 Twitter for iPhone 142 NA
## reply_to_user_id reply_to_screen_name is_quote is_retweet favorite_count
## 67488 NA <NA> FALSE TRUE 0
## 67489 NA <NA> FALSE TRUE 0
## 67490 NA <NA> FALSE TRUE 0
## 67491 NA <NA> FALSE TRUE 0
## 67492 NA <NA> FALSE TRUE 0
## 67493 NA <NA> FALSE TRUE 0
## retweet_count quote_count reply_count
## 67488 10287 NA NA
## 67489 144 NA NA
## 67490 10287 NA NA
## 67491 10287 NA NA
## 67492 38 NA NA
## 67493 2 NA NA
## hashtags symbols urls_url urls_t.co
## 67488 c("OneWordReview", "Tanhaji", "Tanhaji") NA <NA> <NA>
## 67489 c("Tanhaji", "TanhajiReview") NA <NA> <NA>
## 67490 c("OneWordReview", "Tanhaji", "Tanhaji") NA <NA> <NA>
## 67491 c("OneWordReview", "Tanhaji", "Tanhaji") NA <NA> <NA>
## 67492 c("Tanhaji", "Boycott_Chhapaak") NA <NA> <NA>
## 67493 c("Thalaiva", "Darbar", "Tanhaji", "Chapaak") NA <NA> <NA>
## urls_expanded_url media_url media_t.co media_expanded_url media_type
## 67488 <NA> <NA> <NA> <NA> <NA>
## 67489 <NA> <NA> <NA> <NA> <NA>
## 67490 <NA> <NA> <NA> <NA> <NA>
## 67491 <NA> <NA> <NA> <NA> <NA>
## 67492 <NA> <NA> <NA> <NA> <NA>
## 67493 <NA> <NA> <NA> <NA> <NA>

Page | 40
## ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 67488 <NA> <NA> <NA> NA
## 67489 <NA> <NA> <NA> NA
## 67490 <NA> <NA> <NA> NA
## 67491 <NA> <NA> <NA> NA
## 67492 <NA> <NA> <NA> NA
## 67493 <NA> <NA> <NA> NA
## mentions_user_id mentions_screen_name lang
## 67488 99642673 taran_adarsh en
## 67489 1610358128 iSKsCombat_ en
## 67490 99642673 taran_adarsh en
## 67491 99642673 taran_adarsh en
## 67492 c("1064198326990458880", "63796828") c("IamPurn", "verified") en
## 67493 1135988009453547520 Justano84979866 en
## quoted_status_id quoted_text quoted_created_at quoted_source
## 67488 NA <NA> <NA> <NA>
## 67489 NA <NA> <NA> <NA>
## 67490 NA <NA> <NA> <NA>
## 67491 NA <NA> <NA> <NA>
## 67492 NA <NA> <NA> <NA>
## 67493 NA <NA> <NA> <NA>
## quoted_favorite_count quoted_retweet_count quoted_user_id
## 67488 NA NA NA
## 67489 NA NA NA
## 67490 NA NA NA
## 67491 NA NA NA
## 67492 NA NA NA
## 67493 NA NA NA
## quoted_screen_name quoted_name quoted_followers_count
## 67488 <NA> <NA> NA
## 67489 <NA> <NA> NA
## 67490 <NA> <NA> NA
## 67491 <NA> <NA> NA
## 67492 <NA> <NA> NA
## 67493 <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location
## 67488 NA NA <NA>
## 67489 NA NA <NA>
## 67490 NA NA <NA>
## 67491 NA NA <NA>
## 67492 NA NA <NA>
## 67493 NA NA <NA>
## quoted_description quoted_verified retweet_status_id
## 67488 <NA> NA 1.215295e+18
## 67489 <NA> NA 1.215298e+18
## 67490 <NA> NA 1.215295e+18
## 67491 <NA> NA 1.215295e+18
## 67492 <NA> NA 1.215317e+18
## 67493 <NA> NA 1.215396e+18
##
retweet_text
## 67488 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67489
"Bharpoor Entertainer, Lajawab Film #Tanhaji".- Journalist #TanhajiReview\n\nBest Of
Luck &amp; Congratulations In Advance !!!\n@ajaydevgn and Fans !!!
#TanhajiTheUnsungWarrior https://t.co/0oV0N3TDpn

Page | 41
## 67490 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67491 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 67492
Me:- 2morrow my a/c will be @verified &amp; we both will go to see #Tanhaji also
#Boycott_Chhapaak\n\nTwitter :-
<U+0001F447><U+0001F447><U+0001F937><U+0001F3FB><U+200D><U+2642><U+FE0F><U+0001F648><U+00
01F602> https://t.co/eQm6hUrwYm
## 67493
#Thalaiva the day 1 business of #Darbar &gt; #Tanhaji + #Chapaak more than double it is
actually if the worldwide fig <U+0001F60D> one sun .. one moon .. one star ..
superstar ..
## retweet_created_at retweet_source retweet_favorite_count
## 67488 2020-01-09 15:31:45 Twitter for iPad 49234
## 67489 2020-01-09 15:42:11 Twitter Web App 312
## 67490 2020-01-09 15:31:45 Twitter for iPad 49234
## 67491 2020-01-09 15:31:45 Twitter for iPad 49234
## 67492 2020-01-09 16:58:50 Twitter for Android 10
## 67493 2020-01-09 22:12:14 Twitter for iPhone 7
## retweet_retweet_count retweet_user_id retweet_screen_name
## 67488 10287 9.964267e+07 taran_adarsh
## 67489 144 1.610358e+09 iSKsCombat_
## 67490 10287 9.964267e+07 taran_adarsh
## 67491 10287 9.964267e+07 taran_adarsh
## 67492 38 1.064198e+18 IamPurn
## 67493 2 1.135988e+18 Justano84979866
##
retweet_name
## 67488
taran adarsh
## 67489
Sardar Singh
## 67490
taran adarsh
## 67491
taran adarsh
## 67492
Nipun<U+0001F1EE><U+0001F1F3><U+0001F441><U+FE0F><U+0001F443><U+0001F441><U+FE0F><U+0001F
6A9>
## 67493 <U+0930><U+093E><U+0927><U+0947> <U+092E><U+094B><U+0939><U+0928>
<U+0915><U+0947> <U+092B><U+093C><U+0948><U+0928>
## retweet_followers_count retweet_friends_count retweet_statuses_count
## 67488 3741490 168 34910
## 67489 1858 1342 5520
## 67490 3741490 168 34910
## 67491 3741490 168 34910
## 67492 8873 6668 17149
## 67493 3 2 265
##
retweet_location
## 67488
Mumbai, India
## 67489

Page | 42
Punjab, India
## 67490
Mumbai, India
## 67491
Mumbai, India
## 67492 <U+0938><U+093E><U+0930><U+0947> <U+091C><U+0939><U+0949> <U+0938><U+0947>
<U+0905><U+091A><U+094D><U+091B><U+093E><U+0001F449><U+092D><U+093E><U+0930><U+0924>
<U+092A><U+094D><U+092F><U+093E><U+0930><U+093E>
## 67493

##
retweet_description
## 67488
Movie critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 67489
If you even dream of beating me, you better wake up and apologise!!! .\nOnly God Of
Bollywood @BeingSalmanKhan Matters And Rules!!! \n#SalmanKhan Fan
## 67490
Movie critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 67491
Movie critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 67492
#<U+0932><U+0947><U+0916><U+0915><U+270D><U+FE0F>#<U+0930><U+093E><U+0937><U+094D><U+091F
><U+094D><U+0930><U+0935><U+093E><U+0926><U+0940>,#TaxPayer,TwitsIn<U+2764>
#LovesNature<U+0001F331>,flwdBy<U+0001F449>@SDPachauri
@bhavsarhardiik,@Real_anuj,@shwait_malik @caopmishra @zammit_marc<U+0001F60D> Sis-
@Saritasidh
## 67493
Here only for Salman Khan ..
## retweet_verified place_url place_name place_full_name place_type country
## 67488 TRUE <NA> <NA> <NA> <NA> <NA>
## 67489 FALSE <NA> <NA> <NA> <NA> <NA>
## 67490 TRUE <NA> <NA> <NA> <NA> <NA>
## 67491 TRUE <NA> <NA> <NA> <NA> <NA>
## 67492 FALSE <NA> <NA> <NA> <NA> <NA>
## 67493 FALSE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 67488 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 67489 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 67490 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 67491 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 67492 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 67493 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url
## 67488 https://twitter.com/vk9378/status/1215422014736875521
## 67489 https://twitter.com/DeepakK98376858/status/1215422085570281472
## 67490 https://twitter.com/dev4Ind/status/1215422241980088320
## 67491 https://twitter.com/Ornawalla/status/1215422292391419904
## 67492 https://twitter.com/KrishnamitraHKJ/status/1215422634613071873
## 67493 https://twitter.com/vishalmellark/status/1215422867610861569
## name location
## 67488 Vikas Kumar <U+0001F1EE><U+0001F1F3> Agartala, India
## 67489 Deepak Kumar
## 67490 devp India
## 67491 P B India
## 67492 Krishnamitra Jauhar
## 67493 buttercakeluv
##
description
## 67488

Page | 43
## 67489
Student
## 67490
VandeMatram
## 67491
Liberals are bunch of Chutiyas. A Proud Hindu. Supporter of Truth. No Bullshit, Just
State the Facts. NaMo Fan. No Poverty. Respect <U+270A>.
## 67492 <U+0938><U+0943><U+0937><U+094D><U+091F><U+093F> <U+0939><U+0948>
<U+0939><U+0930><U+093F> <U+092E><U+0928><U+094D><U+0926><U+093F><U+0930>
<U+092E><U+0947><U+0930><U+093E>, <U+0927><U+094D><U+092F><U+093E><U+0928>
<U+0939><U+0948> <U+0938><U+091A><U+094D><U+091A><U+0940>
<U+092A><U+0942><U+091C><U+093E><U+0964>
<U+0938><U+092C><U+092E><U+0947><U+0902> <U+0915><U+0943><U+0937><U+094D><U+0923>
<U+0928><U+093F><U+0939><U+093E><U+0930><U+0942><U+0901>
'<U+091C><U+094C><U+0939><U+0930>',<U+092D><U+093E><U+0935> <U+0928>
<U+0930><U+093E><U+0916><U+0942><U+0901> <U+0926><U+0942><U+091C><U+093E><U+0964><U+0964>
\n<U+0001F449>My tweets are\nin<U+0001F449><U+2665><U+FE0F>Likes<U+0001F448>
## 67493
I apologize in advance. snapchat : vishalmellark | instagram : buttercakeluv
## url protected followers_count friends_count listed_count statuses_count
## 67488 <NA> FALSE 378 1358 10 89474
## 67489 <NA> FALSE 18 472 0 189
## 67490 <NA> FALSE 101 234 3 11762
## 67491 <NA> FALSE 37 105 0 13768
## 67492 <NA> FALSE 11013 513 23 214923
## 67493 <NA> FALSE 117 159 0 24693
## favourites_count account_created_at verified profile_url
## 67488 137120 2016-03-12 03:11:37 FALSE <NA>
## 67489 1015 2019-12-27 17:07:01 FALSE <NA>
## 67490 18285 2016-11-02 04:36:34 FALSE <NA>
## 67491 12866 2010-09-22 18:38:41 FALSE <NA>
## 67492 5150 2017-04-14 14:15:19 FALSE <NA>
## 67493 33882 2015-01-17 03:36:10 FALSE <NA>
## profile_expanded_url account_lang
## 67488 <NA> NA
## 67489 <NA> NA
## 67490 <NA> NA
## 67491 <NA> NA
## 67492 <NA> NA
## 67493 <NA> NA
## profile_banner_url
## 67488 https://pbs.twimg.com/profile_banners/708490600933367808/1457753503
## 67489 https://pbs.twimg.com/profile_banners/1210607948184997888/1577769005
## 67490 <NA>
## 67491 <NA>
## 67492 https://pbs.twimg.com/profile_banners/852887997448155137/1528385874
## 67493 https://pbs.twimg.com/profile_banners/2986419235/1576170519
## profile_background_url
## 67488 http://abs.twimg.com/images/themes/theme1/bg.png
## 67489 <NA>
## 67490 <NA>
## 67491 http://abs.twimg.com/images/themes/theme5/bg.gif
## 67492 <NA>
## 67493 http://abs.twimg.com/images/themes/theme1/bg.png
## profile_image_url
## 67488 http://pbs.twimg.com/profile_images/1100769543012605952/bpjVOp0C_normal.jpg
## 67489 http://pbs.twimg.com/profile_images/1211877019077640192/4c7pFtKC_normal.jpg
## 67490 http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png
## 67491 http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png

Page | 44
## 67492 http://pbs.twimg.com/profile_images/1004746850199527425/33n7gVGL_normal.jpg
## 67493 http://pbs.twimg.com/profile_images/1213529784237486080/RtX3Uiat_normal.jpg

Retrive orginal tweets


# Remove retweets
movie_tweets_original <- pre_tweets_data[pre_tweets_data$is_retweet==FALSE, ]

# Remove replies
movie_tweets_original <- subset(movie_tweets_original,
is.na(movie_tweets_original$reply_to_status_id))

Favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets)

# favorite_count
movie_tweets_original <- movie_tweets_original %>% arrange(-favorite_count)
movie_tweets_original[1,5]
## [1] "2020-01-08 08:31:48"
#retweet_count
movie_tweets_original <- movie_tweets_original %>% arrange(-retweet_count)
movie_tweets_original[1,5]
## [1] "2020-01-09 15:31:45"

SHOW THE RATIO OF REPLIES/RETWEETS/ORIGINAL TWEETS

# dataset containing only the retweets and one containing only the replies.

# Keeping only the retweets


movie_retweets <- pre_tweets_data[pre_tweets_data$is_retweet==TRUE,]

# Keeping only the replies


movie_replies <- subset(pre_tweets_data, !is.na(pre_tweets_data$reply_to_status_id))

Create a separate data frame containing the number of original tweets, retweets, and replies

# Creating a data frame

original_count <- nrow(movie_tweets_original)


retweets_count <- nrow(movie_retweets)
replies_count <- nrow(movie_replies)

movie_data <- data.frame(


category=c("Original", "Retweets", "Replies"),
count=c(original_count, retweets_count, replies_count )
)

PLOTTING THE TYPES OF TWEETS (ORIGINAL, REPLIES, RETWEETS)

# Adding columns
movie_data$fraction = movie_data$count / sum(movie_data$count)
movie_data$percentage = movie_data$count / sum(movie_data$count) * 100
movie_data$ymax = cumsum(movie_data$fraction)
movie_data$ymin = c(0, head(movie_data$ymax, n=-1))

# Rounding the movie_data to two decimal points


movie_data <- round_df(movie_data, 2)

# Specify what the legend should say


Type_of_Tweet <- paste(movie_data$category, movie_data$percentage, "%")

Page | 45
ggplot(movie_data, aes(ymax=ymax, ymin=ymin,
xmax=4, xmin=3, fill=Type_of_Tweet)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")

SHOW THE SOURCE DEVICES FROM WHERE THE TWEETS ARE PUBLISHED

tw_app <- pre_tweets_data %>%


select(source) %>%
group_by(source) %>%
summarize(count=n())
tw_app <- subset(tw_app, count > 1000)

device_data <- data.frame(


category=tw_app$source,
count=tw_app$count
)
device_data$fraction = device_data$count / sum(device_data$count)
device_data$percentage = device_data$count / sum(device_data$count) * 100
device_data$ymax = cumsum(device_data$fraction)
device_data$ymin = c(0, head(device_data$ymax, n=-1))
device_data <- round_df(device_data, 2)
Source <- paste(device_data$category, device_data$percentage, "%")
ggplot(device_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built
initially
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")

Page | 46
SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS

#Cleaning the data


movie_tweets_original$text <- gsub("https\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("@\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("amp", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[\r\n]", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[[:punct:]]", "", movie_tweets_original$text)

# remove stop words from the text

tweets <- movie_tweets_original %>%


select(text) %>%
unnest_tokens(word, text)
tweets <- tweets %>%
anti_join(stop_words)
## Joining, by = "word"

PLOT THE MOST FREQUENT WORDS IN THE TWEETS

# gives a bar chart of the most


frequent words found in the tweets
tweets %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in the tweets of the movie",
subtitle = "Stop words removed from the list")
## Selecting by n

Page | 47
SHOW THE MOST FREQUENTLY USED HASHTAGS

movie_tweets_original$hashtags <- as.character(movie_tweets_original$hashtags)


movie_tweets_original$hashtags <- gsub("c\\(", "", movie_tweets_original$hashtags)
set.seed(1234)
wordcloud(movie_tweets_original$hashtags, min.freq=50, scale=c(2, 1), random.order=FALSE,
rot.per=0.35, colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 48
SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE

set.seed(1234)
wordcloud(pre_tweets_data$retweet_screen_name, min.freq=200, scale=c(2, .5),
random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

SHOWS THE LOCATION FROM WHICH THE MOST TWEETS BELONGS

set.seed(1234)

wordcloud(pre_tweets_data$location, min.freq=200, scale=c(3, 1), random.order=FALSE,


rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 49
#wordcloud(tweets_data$retweet_location, min.freq=500,colors=brewer.pal(8, "Dark2"))

SHOWS THE LOCATION FROM WHICH THE MOST RETWEETS BELONGS

set.seed(1234)

wordcloud(pre_tweets_data$retweet_location, min.freq=200, scale=c(3, 1),


random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 50
#wordcloud(tweets_data$retweet_location, min.freq=500,colors=brewer.pal(8, "Dark2"))

PERFORM A SENTIMENT ANALYSIS OF THE TWEETS ( “syuzhet” package )

# Converting tweets to ASCII to trackle strange characters


tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")

# removing retweets, in case needed


tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)

# removing mentions, in case needed


tweets <-gsub("@\\w+","",tweets)

ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))
names(sentimentscores) <- "Score"
sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)
rownames(sentimentscores) <- NULL
ggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiments")+ylab("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()

CLEANING THE DATA

pre_tweets_data$text = gsub("&amp", "", pre_tweets_data$text)


pre_tweets_data$text = gsub("&amp", "", pre_tweets_data$text)
pre_tweets_data$text = gsub("rt|RT", "", pre_tweets_data$text) # remove Retweet
pre_tweets_data$text = iconv(pre_tweets_data$text, "latin1", "ASCII", sub="") # Remove
emojis/dodgy unicode
pre_tweets_data$text = gsub("<(.*)>", "", pre_tweets_data$text) # Remove pesky Unicodes
like <U+A>

Page | 51
pre_tweets_data$text = gsub("https(.*)*$", "", pre_tweets_data$text) # remove tweet URL
pre_tweets_data$text = gsub("www[[:alnum:][:punct:]]*","", tolower(pre_tweets_data$text
))
pre_tweets_data$text = gsub("<.*?>", "", pre_tweets_data$text) # remove html tags
pre_tweets_data$text = gsub("@\\w+", "", pre_tweets_data$text) # remove at(@)
pre_tweets_data$text = gsub("[[:punct:]]", "", pre_tweets_data$text) # remove punctuation
pre_tweets_data$text = gsub("\r?\n|\r", " ", pre_tweets_data$text) # remove /n
pre_tweets_data$text = gsub("[[:digit:]]", " ", pre_tweets_data$text) # remove
numbers/Digits
pre_tweets_data$text = gsub("[ |\t]{2,}", " ", pre_tweets_data$text) # remove tabs
pre_tweets_data$text = gsub("^ ", "", pre_tweets_data$text) # remove blank spaces at the
beginning
pre_tweets_data$text = gsub(" $", "", pre_tweets_data$text) # remove blank spaces at the
end

head(pre_tweets_data$text)
## [1] "with just days to the release of tanhajitheunsungwarriror the makers are making
the most of the time to promote the movie saifalikhan tanhaji"

## [2] "days to go tanhaji"

## [3] "panindian movie darbar oveakes telugu biggie sarilerunekkevvaru and hindi biggie
tanhaji to become the most awaited movie lets hit k interests before the end of this
weekend folks k before release"

## [4] "tanhaji rated by british censor board bbfc running time m s bbfcinsight strong
violence bloody images tanhajitheunsungwarrior a historical action drama set in th
century in which a maratha warrior embarks on a mission to recapture a hill foress taken
by mughal"
## [5] "panindian movie darbar oveakes telugu biggie sarilerunekkevvaru and hindi biggie
tanhaji to become the most awaited movie lets hit k interests before the end of this
weekend folks k before release"

## [6] "tanhaji in hindi belts and darbar in south to have huge box office openings as
per interest shown in bms both have k and more chhapak to sta on a dull note everything
depends on womtanhaji"

Create subset of tweets text

set.seed(777) # Make process reproducible


sub_blogs =
pre_tweets_data$text[sample(length(pre_tweets_data$text),length(pre_tweets_data$text)*0.1
)] # make subset

Creating a corpus and cleaning data

sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus


sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white
spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeWords , stopwords("english")) # Remove
english stop words

Tokenizing, calculating frequencies and making plots of n-grams

Page | 52
n_grams_plot <- function(n, data) {
options(mc.cores=1)

# Builds n-gram tokenizer


tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# Create matrix
ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
# make matrix for easy view
ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
# find 20 most frequent n-grams in the matrix
ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))

# plots
ggplot(ngrams_matrix, aes(x=word, y=freq)) +
geom_bar(stat="Identity", fill="pink", colour="black") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") +
ylab("Frequency")
}

Plot of frequency distribution of 1-gram

n_grams_plot(n=1, data=sub_blogs_Corpus)

Page | 53
Plot of frequency distribution of 2-gram

n_grams_plot(n=2, data=sub_blogs_Corpus)

Plot of frequency distribution of 3-gram

n_grams_plot(n=3, data=sub_blogs_Corpus)

Page | 54
Plot of frequency distribution of 4-gram

n_grams_plot(n=4, data=sub_blogs_Corpus)

Create the Corpus and define and get the rating of the movie based on the score given by syuzhet
package

sent.value <- get_sentiment(pre_tweets_data$text)

corpus_tw = Corpus(VectorSource(pre_tweets_data$text))

corpus_tw = tm_map(corpus_tw, tolower)


## Warning in tm_map.SimpleCorpus(corpus_tw, tolower): transformation drops
## documents
corpus_tw = tm_map(corpus_tw, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_tw, removePunctuation): transformation
## drops documents
corpus_tw = tm_map(corpus_tw, removeWords, c(stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus_tw, removeWords, c(stopwords("english"))):
## transformation drops documents
corpus_tw = tm_map(corpus_tw, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus_tw, stemDocument): transformation drops
## documents
frequencies_tw = DocumentTermMatrix(corpus_tw)

sparse_tw = removeSparseTerms(frequencies_tw, 0.995)

Page | 55
sparse_tw.df = as.data.frame(as.matrix(sparse_tw))

colnames(sparse_tw.df) = make.names(colnames(sparse_tw.df))

Classify the tweets based on the scores provided by get_sentiment function into 5 categories.

#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value >= 0 & sent.value


< 0.5, "Average",
# ifelse(sent.value >= 0.5 & sent.value < 1, "Good",
# ifelse(sent.value >=1 & sent.value < 1.5 ,"Very
Good","Excellent"))))

#sparse_tw.df$Polarity = category_sentiment

#table(sparse_tw.df$Polarity)
#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value == 0 , "Ignore",
# ifelse(sent.value > 0 & sent.value < 0.5, "Average",
# ifelse(sent.value >=0.5 & sent.value < 1 ,"Good",
# ifelse(sent.value >= 1 & sent.value < 1.5 ,"Very
Good","Excellent")))))

#sparse_tw.df$Polarity = category_sentiment

#table(sparse_tw.df$Polarity)
category_sentiment <- ifelse(sent.value < 0, 1, ifelse(sent.value == 0 , "Ignore",
ifelse(sent.value > 0 & sent.value < 0.5, 2,
ifelse(sent.value >=0.5 & sent.value < 1 ,3,
ifelse(sent.value >= 1 & sent.value < 1.5 ,4,5)))))

sparse_tw.df$Polarity = category_sentiment

table(sparse_tw.df$Polarity)
##
## 1 2 3 4 5 Ignore
## 8345 6124 14411 12434 15069 11110
sparse_tw_new.df <- filter(sparse_tw.df, Polarity != "Ignore")

table(sparse_tw_new.df$Polarity)
##
## 1 2 3 4 5
## 8345 6124 14411 12434 15069
ggplot(data=sparse_tw_new.df, aes(Polarity, fill=Polarity))+ geom_bar()

Page | 56
Working on it

ggplot(sparse_tw.df, aes(x= Polarity)) +


geom_bar(aes(y = ..prop.., fill = Polarity ) , stat="count") +
geom_text(aes( label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", vjust = -.5)+
labs(y = "Percent") +
scale_y_continuous(labels = scales::percent)

Page | 57
BUILD CLASSIFICATION MODELS AND PREDICT FOR PRE RELEASE ANALYSIS

We will use different classification models and check its accuracy and performance. Polarity
will be the dependant variable.
prop.table(table(sparse_tw_new.df$Polarity))
##
## 1 2 3 4 5
## 0.1480056 0.1086143 0.2555912 0.2205275 0.2672614

Extract random 5000 observations from the DTM

#model_data <- sparse_tw_new.df %>% sample_frac(0.10)


model_data <- sample_n(sparse_tw_new.df, 5000)
dim(model_data)
## [1] 5000 423

Split the data into Train and Test

library(caTools)
##
## Attaching package: 'caTools'
## The following object is masked from 'package:RWeka':
##
## LogitBoost
set.seed(777)

model_data$Polarity <- as.factor(model_data$Polarity)

spl = sample.split(model_data$Polarity, SplitRatio = 0.7)

train_data = subset(model_data, spl == TRUE)


test_data = subset(model_data, spl == FALSE)

prop.table(table(train_data$Polarity))
##
## 1 2 3 4 5
## 0.1482857 0.1080000 0.2491429 0.2268571 0.2677143
prop.table(table(test_data$Polarity))
##
## 1 2 3 4 5
## 0.1480000 0.1080000 0.2493333 0.2273333 0.2673333

Build CART Model

# Load the Libraries


library(rpart)
library(rpart.plot)

movie_cart_model = rpart(Polarity ~ ., data=train_data, method="class")

#CART Diagram
prp(movie_cart_model, extra=2)

Page | 58
Predict and Evaluate the Performance of CART train data

predict_cart_train_pre = predict(movie_cart_model, data=train_data, type="class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_cart <- table(train_data$Polarity, predict_cart_train_pre)
confusion_matrix_cart
## predict_cart_train_pre
## 1 2 3 4 5
## 1 293 0 217 1 8
## 2 0 81 292 1 4
## 3 2 0 847 3 20
## 4 6 0 182 583 23
## 5 0 0 166 52 719
# Baseline accuracy
accuracy_cart_train_pre = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_train_pre
## [1] 0.7208571

Predict and Evaluate the Performance of CART on test data

predict_cart_test_pre = predict(movie_cart_model, newdata=test_data, type="class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_cart <- table(test_data$Polarity, predict_cart_test_pre)
confusion_matrix_cart
## predict_cart_test_pre
## 1 2 3 4 5
## 1 133 0 88 0 1
## 2 0 35 122 0 5
## 3 0 0 363 1 10

Page | 59
## 4 6 0 77 250 8
## 5 0 0 73 31 297
# Baseline accuracy
accuracy_cart_test_pre = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_test_pre
## [1] 0.7186667

AUC-ROC Curve for CART on Train and Test dataset

#Train data - Plot ROC curve


roc_obj_cart_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_cart_train_pre),quiet=T
RUE)
roc_obj_cart_train_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_cart_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_train_pre) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8387
#Test data - Plot ROC curve
roc_obj_cart_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_pre),quiet=TRU
E)
roc_obj_cart_test_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_cart_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_pre) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8389

Comparison of all the performace measure of CART Model on Train and Test dataset

results_cart_train_pre = data.frame(accuracy_cart_train_pre,
as.numeric(roc_obj_cart_train_pre$auc))
names(results_cart_train_pre) = c("ACCURACY", "AUC-ROC" )

results_cart_test_pre =
data.frame(accuracy_cart_test_pre,as.numeric(roc_obj_cart_test_pre$auc) )
names(results_cart_test_pre) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_cart_train_pre, results_cart_test_pre)


row.names(df_fin) = c('CART Train Pre', 'CART Test Pre')
df_fin
## ACCURACY AUC-ROC
## CART Train Pre 0.7208571 0.8386559
## CART Test Pre 0.7186667 0.8389384

Page | 60
Build Random Forest Model

# Load Library
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(777)

movie_rf_model = randomForest(Polarity ~ ., data=train_data,importance=TRUE)


movie_rf_model
##
## Call:
## randomForest(formula = Polarity ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 20
##
## OOB estimate of error rate: 6.46%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 458 14 44 2 1 0.11753372
## 2 8 335 29 1 5 0.11375661
## 3 12 5 849 5 1 0.02637615
## 4 7 6 32 741 8 0.06675063
## 5 4 9 29 4 891 0.04909285

Predict and Evaluate the Performance of Random Forest on train

# Make predictions:
predict_rf_train_pre = predict(movie_rf_model, data=train_data,type="response")

# Evaluate the performance - Confusion matrix :


confusion_matrix_rf <- table(train_data$Polarity, predict_rf_train_pre)
confusion_matrix_rf
## predict_rf_train_pre
## 1 2 3 4 5
## 1 458 14 44 2 1
## 2 8 335 29 1 5
## 3 12 5 849 5 1
## 4 7 6 32 741 8
## 5 4 9 29 4 891
# Baseline accuracy:
accuracy_rf_train_pre = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_train_pre
## [1] 0.9354286

Page | 61
Predict and Evaluate the Performance of Random Forest on test data

# Make predictions:
predict_rf_test_pre = predict(movie_rf_model, newdata=test_data,type="response")

# Evaluate the performance - Confusion matrix :


confusion_matrix_rf <- table(test_data$Polarity, predict_rf_test_pre)
confusion_matrix_rf
## predict_rf_test_pre
## 1 2 3 4 5
## 1 203 5 14 0 0
## 2 6 139 12 1 4
## 3 6 2 365 0 1
## 4 4 1 17 318 1
## 5 2 0 21 4 374
# Baseline accuracy:
accuracy_rf_test_pre = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_test_pre
## [1] 0.9326667

Variable Importance of Random Forest

#Variable importance:
#varImpPlot(movie_rf_model,main='Variable Importance Plot: Movie ',type=2)
varImpPlot(movie_rf_model,sort=TRUE,type=NULL, class=NULL, scale=TRUE,cex=.8)

AUC-ROC Curve for CART on Train and Test dataset

#Train data - Plot ROC curve


roc_obj_rf_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_rf_train_pre),quiet=TRU
E)
roc_obj_rf_train_pre

Page | 62
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_rf_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_rf_train_pre) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.9585
#Test data - Plot ROC curve
roc_obj_rf_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_pre),quiet=TRU
E)
roc_obj_rf_test_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_cart_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_pre) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8389

Comparison of all the performace measure of Random Forest Model on Train and Test dataset

results_rf_train_pre = data.frame(accuracy_rf_train_pre,
as.numeric(roc_obj_rf_train_pre$auc))
names(results_rf_train_pre) = c("ACCURACY", "AUC-ROC" )

results_rf_test_pre = data.frame(accuracy_rf_test_pre,as.numeric(roc_obj_rf_test_pre$auc)
)
names(results_rf_test_pre) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_rf_train_pre, results_rf_test_pre)


row.names(df_fin) = c('Random Forest Train Pre', 'Random Forest Test Pre')
df_fin
## ACCURACY AUC-ROC
## Random Forest Train Pre 0.9354286 0.9584646
## Random Forest Test Pre 0.9326667 0.8389384

Build SVM Model

set.seed(123)
library(e1071)

movie_svm_model = svm(Polarity ~ . , data = train_data)


## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'survey' and 'lagaan' and 'tadka' and 'britishcensor' and 'azaadi'
## and 'raheng' and 'yaaar' constant. Cannot scale data.

Predict and Evaluate the performance of SVM Model on train data

# Make predictions:
predict_svm_train_pre = predict(movie_svm_model, data=train_data, decision.values=TRUE)

# Evaluate the performance - Confusion matrix :


confusion_matrix_svm <- table(train_data$Polarity, predict_svm_train_pre)
confusion_matrix_svm

Page | 63
## predict_svm_train_pre
## 1 2 3 4 5
## 1 272 10 181 7 49
## 2 1 116 203 0 58
## 3 0 0 857 1 14
## 4 3 0 220 524 47
## 5 0 1 105 1 830
# Baseline accuracy:
accuracy_svm_train_pre = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_train_pre
## [1] 0.7425714

Predict and Evaluate the performance of SVM Model on test data

# Make predictions:
predict_svm_test_pre = predict(movie_svm_model, newdata=test_data,decision.values=TRUE)

# Evaluate the performance - Confusion matrix :


confusion_matrix_svm <- table(test_data$Polarity, predict_svm_test_pre)
confusion_matrix_svm
## predict_svm_test_pre
## 1 2 3 4 5
## 1 123 4 74 2 19
## 2 0 50 92 0 20
## 3 0 0 363 0 11
## 4 1 0 98 218 24
## 5 0 0 50 4 347
# Baseline accuracy:
accuracy_svm_test_pre = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_test_pre
## [1] 0.734

AUC-ROC Curve for SVM model on Train and Test dataset

library(pROC)

#Train data - Plot ROC curve


roc_obj_svm_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_svm_train_pre),quiet=TR
UE)
roc_obj_svm_train_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_svm_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_svm_train_pre) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8089
#Test data - Plot ROC curve
roc_obj_svm_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_svm_test_pre),quiet=TRUE
)
roc_obj_svm_test_pre
##
## Call:

Page | 64
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_svm_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_svm_test_pre) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.8154

Comparison of all the performace measure of SVM Model on Train and Test dataset

results_svm_train_pre = data.frame(accuracy_svm_train_pre,
as.numeric(roc_obj_svm_train_pre$auc))
names(results_svm_train_pre) = c("ACCURACY", "AUC-ROC" )

results_svm_test_pre =
data.frame(accuracy_svm_test_pre,as.numeric(roc_obj_svm_test_pre$auc) )
names(results_svm_test_pre) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_svm_train_pre, results_svm_test_pre)


row.names(df_fin) = c('SVM Train Pre', 'SVM Test Pre')
df_fin
## ACCURACY AUC-ROC
## SVM Train Pre 0.7425714 0.8089172
## SVM Test Pre 0.7340000 0.8153728

Build Naive Bayes Model

set.seed(777)

movie_nb_model = naiveBayes(Polarity ~ . , usekernel=T, data = train_data)

Predict and Evaluate the performance of NB Model on train data

# Make predictions:
predict_nb_train_pre = predict(movie_nb_model, train_data, type = "class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_nb <- table(train_data$Polarity, predict_nb_train_pre)
confusion_matrix_nb
## predict_nb_train_pre
## 1 2 3 4 5
## 1 370 72 12 65 0
## 2 28 290 6 54 0
## 3 77 136 199 460 0
## 4 52 77 8 657 0
## 5 181 245 89 247 175
# Baseline accuracy:
accuracy_nb_train_pre = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_train_pre
## [1] 0.4831429

Predict and Evaluate the performance of NB Model on test data

# Make predictions:
predict_nb_test_pre = predict(movie_nb_model,newdata = test_data, type = "class")

# Evaluate the performance - Confusion matrix :

Page | 65
confusion_matrix_nb <- table(test_data$Polarity, predict_nb_test_pre)
confusion_matrix_nb
## predict_nb_test_pre
## 1 2 3 4 5
## 1 155 36 5 26 0
## 2 10 126 3 23 0
## 3 43 77 92 162 0
## 4 20 39 7 275 0
## 5 79 109 29 109 75
# Baseline accuracy:
accuracy_nb_test_pre = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_test_pre
## [1] 0.482

AUC-ROC Curve for Naive Bayes model on Train and Test dataset

library(pROC)

#Train data - Plot ROC curve


roc_obj_nb_train_pre <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_nb_train_pre),quiet=TRU
E)
roc_obj_nb_train_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_nb_train_pre), quiet = TRUE)
##
## Data: as.numeric(predict_nb_train_pre) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.7285
#Test data - Plot ROC curve
roc_obj_nb_test_pre <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_nb_test_pre),quiet=TRUE)
roc_obj_nb_test_pre
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_nb_test_pre), quiet = TRUE)
##
## Data: as.numeric(predict_nb_test_pre) with 5 levels of as.numeric(test_data$Polarity):
1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.7203

Comparison of all the performace measure of Naive Bayes Model on Train and Test dataset

results_nb_train_pre = data.frame(accuracy_nb_train_pre,
as.numeric(roc_obj_nb_train_pre$auc))
names(results_nb_train_pre) = c("ACCURACY", "AUC-ROC" )

results_nb_test_pre = data.frame(accuracy_nb_test_pre,as.numeric(roc_obj_nb_test_pre$auc)
)
names(results_nb_test_pre) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_nb_train_pre, results_nb_test_pre)

Page | 66
row.names(df_fin) = c('Naive Bayes Train Pre', 'Naive Bayes Test Pre')
df_fin
## ACCURACY AUC-ROC
## Naive Bayes Train Pre 0.4831429 0.7284567
## Naive Bayes Test Pre 0.4820000 0.7202593

Comparing best of both the models - LDA and SVM with their performance - Accuracy , Sensitivity,
Specificity and AUC-ROC Curve

df_fin =rbind(results_cart_train_pre, results_cart_test_pre, results_rf_train_pre,


results_rf_test_pre,
results_svm_train_pre,results_svm_test_pre,results_nb_train_pre,results_nb_test_pre)

row.names(df_fin) = c('CART Train Pre', 'CART Test Pre','Random Forest Train Pre','Random
Forest Test Pre', 'SVM Train Pre','SVM Test Pre','Naive Bayes Train Pre','Naive Bayes
Test Pre')

#round(df_fin,2)

#install.packages("kableExtra")
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
print("Model Performance Comparison Metrics ")
## [1] "Model Performance Comparison Metrics "
kable(round(df_fin,2)) %>%
kable_styling(c("striped","bordered"))

ACCURACY

AUC-ROC

CART Train Pre

0.72

0.84

CART Test Pre

0.72

0.84

Random Forest Train Pre

0.94

0.96

Random Forest Test Pre

Page | 67
0.93

0.84

SVM Train Pre

0.74

0.81

SVM Test Pre

0.73

0.82

Naive Bayes Train Pre

0.48

0.73

Naive Bayes Test Pre

0.48

0.72

POST MOVIE RELEASE ANALYSIS -

Read the tweets after the movie release

post_tweets_data <- movie_tweets_data %>%


filter(created_at >= release_date & created_at < post_first_week_date)
head(post_tweets_data)
## X.1 X user_id status_id created_at screen_name
## 1 158119 91679 1.039020e+18 1.215423e+18 2020-01-10 00:00:23 AJAY_sardar_
## 2 170582 104142 1.214066e+18 1.215423e+18 2020-01-10 00:00:53 Vastha421
## 3 164898 98458 9.079010e+17 1.215423e+18 2020-01-10 00:01:19 RAMUKUM01606330
## 4 172223 105783 8.842913e+07 1.215423e+18 2020-01-10 00:01:42 PARVATHINP
## 5 158922 92482 8.525627e+17 1.215423e+18 2020-01-10 00:01:55 Ankit_patel_AP
## 6 173164 106724 9.325149e+17 1.215423e+18 2020-01-10 00:01:55 NanoMIndia
##
text
## 1
Movie #Tanhaji is not holiday release so comparing its advance with holiday releases only
show ur jealous soul.
## 2
I can see a certain honesty in the early reviews of #Tanhaji which are coming out!! This
is so bloody rare in these times! And the kind of things people are saying,i am now so so
excited to catch it at the earliest! #TanhajiReview
## 3 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 4

Page | 68
#TanhajiTheUnsungWarrior Media Screening report is EPIC , my friend who is attending the
screening in mumbai saying its best film of @ajaydevgn career. #Tanhaji
## 5
WHAT A FILM YAAAR!<U+0001F525><U+0001F525>\nThe climax had the heart-beat racing like
never before. This is how Historic WAR films should be made - Bollywood's Best so far!
@OmRaut &amp; @AjayDevgn TAKE A BOW<U+0001F64F>\n \n#Tanhaji #TanhajiReview
## 6 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## source display_text_width reply_to_status_id reply_to_user_id
## 1 Twitter for Android 128 NA NA
## 2 Twitter for Android 140 NA NA
## 3 Twitter for Android 140 NA NA
## 4 Twitter for Android 140 NA NA
## 5 Twitter for Android 140 NA NA
## 6 Twitter for Android 140 NA NA
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## 1 <NA> FALSE TRUE 0 17
## 2 <NA> FALSE TRUE 0 24
## 3 <NA> FALSE TRUE 0 10287
## 4 <NA> FALSE TRUE 0 948
## 5 <NA> FALSE TRUE 0 631
## 6 <NA> FALSE TRUE 0 10287
## quote_count reply_count hashtags symbols
## 1 NA NA Tanhaji NA
## 2 NA NA Tanhaji NA
## 3 NA NA c("OneWordReview", "Tanhaji", "Tanhaji") NA
## 4 NA NA TanhajiTheUnsungWarrior NA
## 5 NA NA <NA> NA
## 6 NA NA c("OneWordReview", "Tanhaji", "Tanhaji") NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co media_expanded_url
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA>
## media_type ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type
## 1 <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> NA
## 5 <NA> <NA> <NA> <NA> NA
## 6 <NA> <NA> <NA> <NA> NA
## mentions_user_id mentions_screen_name lang quoted_status_id quoted_text
## 1 3086592157 OpinionsRP en NA <NA>
## 2 1337106955 ajay36mittal en NA <NA>
## 3 99642673 taran_adarsh en NA <NA>
## 4 146937987 SumitkadeI en NA <NA>
## 5 142231741 Nilzrav en NA <NA>
## 6 99642673 taran_adarsh en NA <NA>
## quoted_created_at quoted_source quoted_favorite_count quoted_retweet_count
## 1 <NA> <NA> NA NA
## 2 <NA> <NA> NA NA
## 3 <NA> <NA> NA NA
## 4 <NA> <NA> NA NA
## 5 <NA> <NA> NA NA
## 6 <NA> <NA> NA NA

Page | 69
## quoted_user_id quoted_screen_name quoted_name quoted_followers_count
## 1 NA <NA> <NA> NA
## 2 NA <NA> <NA> NA
## 3 NA <NA> <NA> NA
## 4 NA <NA> <NA> NA
## 5 NA <NA> <NA> NA
## 6 NA <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location quoted_description
## 1 NA NA <NA> <NA>
## 2 NA NA <NA> <NA>
## 3 NA NA <NA> <NA>
## 4 NA NA <NA> <NA>
## 5 NA NA <NA> <NA>
## 6 NA NA <NA> <NA>
## quoted_verified retweet_status_id
## 1 NA 1.215320e+18
## 2 NA 1.215349e+18
## 3 NA 1.215295e+18
## 4 NA 1.215203e+18
## 5 NA 1.215344e+18
## 6 NA 1.215295e+18
##
retweet_text
## 1
Movie #Tanhaji is not holiday release so comparing its advance with holiday releases only
show ur jealous soul.
## 2
I can see a certain honesty in the early reviews of #Tanhaji which are coming out!! This
is so bloody rare in these times! And the kind of things people are saying,i am now so so
excited to catch it at the earliest! #TanhajiReview
## 3 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## 4
#TanhajiTheUnsungWarrior Media Screening report is EPIC , my friend who is attending the
screening in mumbai saying its best film of @ajaydevgn career. #Tanhaji
## 5
WHAT A FILM YAAAR!<U+0001F525><U+0001F525>\nThe climax had the heart-beat racing like
never before. This is how Historic WAR films should be made - Bollywood's Best so far!
@OmRaut &amp; @AjayDevgn TAKE A BOW<U+0001F64F>\n \n#Tanhaji #TanhajiReview
## 6 #OneWordReview...\n#Tanhaji: SUPERB.\nRating:
<U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F><U+2B50><U+FE0F>\nDrama, emotions,
conflict, action, VFX, #Tanhaji is an enthralling experience... Electrifying climax...
Top notch direction... #Ajay, #Kajol, #Saif in super form... Get ready for 2020’s first
<U+20B9> <U+0001F4AF>cr+ film. #TanhajiReview https://t.co/N9TwWsWazd
## retweet_created_at retweet_source retweet_favorite_count
## 1 2020-01-09 17:08:44 Twitter for Android 39
## 2 2020-01-09 19:05:16 Twitter for Android 44
## 3 2020-01-09 15:31:45 Twitter for iPad 49234
## 4 2020-01-09 09:24:04 Twitter for iPhone 3490
## 5 2020-01-09 18:47:04 Twitter for Android 2152
## 6 2020-01-09 15:31:45 Twitter for iPad 49234
## retweet_retweet_count retweet_user_id retweet_screen_name retweet_name
## 1 17 3086592157 OpinionsRP Ash
## 2 24 1337106955 ajay36mittal JabTakHaiCinema
## 3 10287 99642673 taran_adarsh taran adarsh
## 4 948 146937987 SumitkadeI Sumit kadel
## 5 631 142231741 Nilzrav N J

Page | 70
## 6 10287 99642673 taran_adarsh taran adarsh
## retweet_followers_count retweet_friends_count retweet_statuses_count
## 1 1526 193 16988
## 2 187 440 12324
## 3 3741490 168 34910
## 4 87103 88 20466
## 5 5055 959 56183
## 6 3741490 168 34910
## retweet_location
## 1 Seven Heaven
## 2
## 3 Mumbai, India
## 4 Kolkata, West Bengal.
## 5 India/ UAE
## 6 Mumbai, India
##
retweet_description
## 1
......
## 2 Engineer,MBA in Finance but defined by my love for
Movies,Acting,Singing,Dancing,Music,Cricket!\nExtremely Occasional Blog 'FILMALAYA' at
following link:
## 3 Movie
critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## 4 Film Trade analyst | Critic | Influencer | Youtube channel -
https://t.co/CaHFAF2LD5 . For work related query email me at - Sumitkadel21@yahoo.com
## 5 Office Manager | Janta's Movie Reviewer | #Chhapaak Review: TODAY | All Things
Humor, Films & 90s Bollywood | Fun RTs | RATIONALIST | Gujju |Food & Freedom
## 6 Movie
critic | Biz analyst | Influencer | Instagram: https://t.co/4hvdbhvfkX
## retweet_verified place_url place_name place_full_name place_type country
## 1 FALSE <NA> <NA> <NA> <NA> <NA>
## 2 FALSE <NA> <NA> <NA> <NA> <NA>
## 3 TRUE <NA> <NA> <NA> <NA> <NA>
## 4 TRUE <NA> <NA> <NA> <NA> <NA>
## 5 FALSE <NA> <NA> <NA> <NA> <NA>
## 6 TRUE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 1 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 2 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 3 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 4 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 5 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 6 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url
## 1 https://twitter.com/AJAY_sardar_/status/1215423099513888770
## 2 https://twitter.com/Vastha421/status/1215423225473064960
## 3 https://twitter.com/RAMUKUM01606330/status/1215423336399826944
## 4 https://twitter.com/PARVATHINP/status/1215423430809411585
## 5 https://twitter.com/Ankit_patel_AP/status/1215423488988598273
## 6 https://twitter.com/NanoMIndia/status/1215423485943541760
##
name
## 1
AJAY
## 2
Vastha42
## 3
RAMU KUMAR
## 4

Page | 71
PARVATHI P
## 5
Ankit<U+2694>
## 6 <U+092E><U+093F><U+091F><U+094D><U+091F><U+0940> <U+0915><U+093E>
<U+092E><U+093E><U+0927><U+094B>
## location
## 1
## 2
## 3 Kuwait
## 4
## 5 follows you
## 6 Navi Mumbai, India
##
description
## 1 bhakt of AJAY
DEVGN, MSDhoni nd MODI JI!!\n<U+0001F1EE><U+0001F1F3><U+0001F1EE><U+0001F1F3>
## 2

## 3 FROM..VIL..MUSAHARI..
POST...KAIL GHAR..PS..BARHARIYA..DST...SIWAN...BIHAR.. LIVE..IN.. KUWAIT.. CITY..
## 4
India is my country.. Bharath Mata ki Jai..
## 5
MovieLover ... \n\n\n\n@ajaydevgn\n\n\n\n... SportLover \n\n@msdhoni
## 6 Staunch follower of Sanatana Dharma,I support Hindutva. \nTotally believe in One
Nation-One Rule. let's unite against terror and everything which is NOT indian.
## url protected followers_count friends_count listed_count statuses_count
## 1 <NA> FALSE 56 349 0 2232
## 2 <NA> FALSE 16 143 0 1781
## 3 <NA> FALSE 18 101 0 273
## 4 <NA> FALSE 846 1329 5 33709
## 5 <NA> FALSE 216 306 0 14024
## 6 <NA> FALSE 2098 4198 1 42894
## favourites_count account_created_at verified profile_url
## 1 7087 2018-09-10 05:17:12 FALSE <NA>
## 2 2021 2020-01-06 06:05:54 FALSE <NA>
## 3 7099 2017-09-13 09:37:22 FALSE <NA>
## 4 27070 2009-11-08 14:28:26 FALSE <NA>
## 5 22156 2017-04-13 16:42:53 FALSE <NA>
## 6 44266 2017-11-20 07:44:18 FALSE <NA>
## profile_expanded_url account_lang
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## profile_banner_url
## 1 https://pbs.twimg.com/profile_banners/1039019940592803841/1578548036
## 2 <NA>
## 3 https://pbs.twimg.com/profile_banners/907901003567063040/1552758484
## 4 https://pbs.twimg.com/profile_banners/88429125/1465792930
## 5 https://pbs.twimg.com/profile_banners/852562745870491648/1571653398
## 6 https://pbs.twimg.com/profile_banners/932514924160364544/1578215611
## profile_background_url
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 http://abs.twimg.com/images/themes/theme1/bg.png
## 5 <NA>

Page | 72
## 6 <NA>
## profile_image_url
## 1 http://pbs.twimg.com/profile_images/1163969231710371840/_dPMB5kq_normal.jpg
## 2 http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png
## 3 http://pbs.twimg.com/profile_images/1106975563229605888/GsW5lHOn_normal.jpg
## 4 http://pbs.twimg.com/profile_images/835309518246420480/lLeww3af_normal.jpg
## 5 http://pbs.twimg.com/profile_images/1215169883228401664/9nfFxm-W_normal.jpg
## 6 http://pbs.twimg.com/profile_images/1201882496511688705/-XcIJvSA_normal.jpg
tail(post_tweets_data)
## X.1 X user_id status_id created_at screen_name
## 105848 66582 142 1.214896e+18 1.216941e+18 2020-01-14 04:32:37 Chintan64138110
## 105849 66576 136 1.158960e+18 1.216941e+18 2020-01-14 04:32:41 Gajendr16824337
## 105850 66448 8 1.387810e+08 1.216941e+18 2020-01-14 04:32:43 onenonlymahakal
## 105851 66443 3 1.163468e+18 1.216941e+18 2020-01-14 04:32:44 shri0944
## 105852 66441 1 1.669707e+09 1.216941e+18 2020-01-14 04:32:45 pruthwirajdeo
## 105853 66442 2 7.057414e+17 1.216941e+18 2020-01-14 04:32:45 Aryaman111
##
text
## 105848 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105849 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105850 As per the reliable
sources #TanhajiTheUnsungWarrior will be declared national film of India by our home
minister @AmitShah in next parliament session.\n\nCongratulations @ajaydevgn @omraut
<U+0001F389><U+0001F389>#Tanhaji
## 105851 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105852 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105853 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## source display_text_width reply_to_status_id
## 105848 Twitter for Android 140 NA
## 105849 Twitter for Android 140 NA
## 105850 Twitter for Android 140 NA
## 105851 Twitter Web App 140 NA
## 105852 Twitter for Android 140 NA
## 105853 Twitter for Android 140 NA
## reply_to_user_id reply_to_screen_name is_quote is_retweet favorite_count
## 105848 NA <NA> FALSE TRUE 0
## 105849 NA <NA> FALSE TRUE 0
## 105850 NA <NA> FALSE TRUE 0
## 105851 NA <NA> FALSE TRUE 0
## 105852 NA <NA> FALSE TRUE 0
## 105853 NA <NA> FALSE TRUE 0
## retweet_count quote_count reply_count hashtags symbols
## 105848 25 NA NA Tanhaji NA
## 105849 25 NA NA Tanhaji NA

Page | 73
## 105850 14 NA NA TanhajiTheUnsungWarrior NA
## 105851 25 NA NA Tanhaji NA
## 105852 25 NA NA Tanhaji NA
## 105853 25 NA NA Tanhaji NA
## urls_url urls_t.co urls_expanded_url media_url media_t.co
## 105848 <NA> <NA> <NA> <NA> <NA>
## 105849 <NA> <NA> <NA> <NA> <NA>
## 105850 <NA> <NA> <NA> <NA> <NA>
## 105851 <NA> <NA> <NA> <NA> <NA>
## 105852 <NA> <NA> <NA> <NA> <NA>
## 105853 <NA> <NA> <NA> <NA> <NA>
## media_expanded_url media_type ext_media_url ext_media_t.co
## 105848 <NA> <NA> <NA> <NA>
## 105849 <NA> <NA> <NA> <NA>
## 105850 <NA> <NA> <NA> <NA>
## 105851 <NA> <NA> <NA> <NA>
## 105852 <NA> <NA> <NA> <NA>
## 105853 <NA> <NA> <NA> <NA>
## ext_media_expanded_url ext_media_type mentions_user_id
## 105848 <NA> NA 2924521080
## 105849 <NA> NA 2924521080
## 105850 <NA> NA 2754072768
## 105851 <NA> NA 2924521080
## 105852 <NA> NA 2924521080
## 105853 <NA> NA 2924521080
## mentions_screen_name lang quoted_status_id quoted_text quoted_created_at
## 105848 davidfrawleyved en NA <NA> <NA>
## 105849 davidfrawleyved en NA <NA> <NA>
## 105850 RoninADfannn en NA <NA> <NA>
## 105851 davidfrawleyved en NA <NA> <NA>
## 105852 davidfrawleyved en NA <NA> <NA>
## 105853 davidfrawleyved en NA <NA> <NA>
## quoted_source quoted_favorite_count quoted_retweet_count quoted_user_id
## 105848 <NA> NA NA NA
## 105849 <NA> NA NA NA
## 105850 <NA> NA NA NA
## 105851 <NA> NA NA NA
## 105852 <NA> NA NA NA
## 105853 <NA> NA NA NA
## quoted_screen_name quoted_name quoted_followers_count
## 105848 <NA> <NA> NA
## 105849 <NA> <NA> NA
## 105850 <NA> <NA> NA
## 105851 <NA> <NA> NA
## 105852 <NA> <NA> NA
## 105853 <NA> <NA> NA
## quoted_friends_count quoted_statuses_count quoted_location
## 105848 NA NA <NA>
## 105849 NA NA <NA>
## 105850 NA NA <NA>
## 105851 NA NA <NA>
## 105852 NA NA <NA>
## 105853 NA NA <NA>
## quoted_description quoted_verified retweet_status_id
## 105848 <NA> NA 1.216941e+18
## 105849 <NA> NA 1.216941e+18
## 105850 <NA> NA 1.216796e+18
## 105851 <NA> NA 1.216941e+18
## 105852 <NA> NA 1.216941e+18
## 105853 <NA> NA 1.216941e+18

Page | 74
##
retweet_text
## 105848 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105849 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105850 As per the reliable
sources #TanhajiTheUnsungWarrior will be declared national film of India by our home
minister @AmitShah in next parliament session.\n\nCongratulations @ajaydevgn @omraut
<U+0001F389><U+0001F389>#Tanhaji
## 105851 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105852 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## 105853 Just saw #Tanhaji, a brilliant, well-acted and inspiring film. Shivaji and the
Marathas created the basis for India's Independence Movement and show the true warrior
spirit of India. Don't understand why Shiv Sena is not highlighting it. Would have made
Bal Thackeray proud.
## retweet_created_at retweet_source retweet_favorite_count
## 105848 2020-01-14 04:30:55 Twitter Web App 99
## 105849 2020-01-14 04:30:55 Twitter Web App 99
## 105850 2020-01-13 18:55:21 Twitter for Android 23
## 105851 2020-01-14 04:30:55 Twitter Web App 99
## 105852 2020-01-14 04:30:55 Twitter Web App 99
## 105853 2020-01-14 04:30:55 Twitter Web App 99
## retweet_retweet_count retweet_user_id retweet_screen_name
## 105848 25 2924521080 davidfrawleyved
## 105849 25 2924521080 davidfrawleyved
## 105850 14 2754072768 RoninADfannn
## 105851 25 2924521080 davidfrawleyved
## 105852 25 2924521080 davidfrawleyved
## 105853 25 2924521080 davidfrawleyved
## retweet_name
## 105848 Dr David Frawley
## 105849 Dr David Frawley
## 105850 THE WHITE WOLF..!! #Tanhaji <U+0001F6A9><U+0001F6A9>
## 105851 Dr David Frawley
## 105852 Dr David Frawley
## 105853 Dr David Frawley
## retweet_followers_count retweet_friends_count retweet_statuses_count
## 105848 307220 153 23992
## 105849 307220 153 23992
## 105850 1106 150 14186
## 105851 307220 153 23992
## 105852 307220 153 23992
## 105853 307220 153 23992
## retweet_location
## 105848 Santa Fe, NM USA
## 105849 Santa Fe, NM USA
## 105850
## 105851 Santa Fe, NM USA
## 105852 Santa Fe, NM USA

Page | 75
## 105853 Santa Fe, NM USA
##
retweet_description
## 105848 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 105849 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 105850

## 105851 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 105852 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## 105853 Pandit Vamadeva Shastri, Vedacharya, author of over forty books, Yoga, Ayurveda
and Vedanta, D.Litt., Padma Bhushan recipient, views personal
## retweet_verified place_url place_name place_full_name place_type country
## 105848 TRUE <NA> <NA> <NA> <NA> <NA>
## 105849 TRUE <NA> <NA> <NA> <NA> <NA>
## 105850 FALSE <NA> <NA> <NA> <NA> <NA>
## 105851 TRUE <NA> <NA> <NA> <NA> <NA>
## 105852 TRUE <NA> <NA> <NA> <NA> <NA>
## 105853 TRUE <NA> <NA> <NA> <NA> <NA>
## country_code geo_coords coords_coords bbox_coords
## 105848 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 105849 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 105850 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 105851 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 105852 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## 105853 <NA> c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA)
## status_url
## 105848 https://twitter.com/Chintan64138110/status/1216941161698349057
## 105849 https://twitter.com/Gajendr16824337/status/1216941181264777217
## 105850 https://twitter.com/onenonlymahakal/status/1216941189032640512
## 105851 https://twitter.com/shri0944/status/1216941193944190976
## 105852 https://twitter.com/pruthwirajdeo/status/1216941197249282050
## 105853 https://twitter.com/Aryaman111/status/1216941195357753345
## name location
## 105848 Chintan Kumar
## 105849 Gajendra Singh Shekhawat
## 105850 Nirmal Aj fan <U+0001F1EE><U+0001F1F3> Indore, India
## 105851 Shri (<U+0936><U+094D><U+0930><U+0940>) Bengaluru, India
## 105852 pruthwiraj
## 105853 Vivek
##
description
## 105848
Rashtravaadi | name changed for security reason |
## 105849
Friendly
## 105850
movie n cricket maniac,shiv bhakt n believer
## 105851 Proud Hindian (Hindu+Indian) \n<U+0C9C><U+0CC8>
<U+0CB6><U+0CCD><U+0CB0><U+0CC0> <U+0CB0><U+0CBE><U+0CAE><U+0CCD>
<U+0001F6A9>\n<U+0CAD><U+0CBE><U+0CB0><U+0CA4><U+0CCD> <U+0CAE><U+0CBE><U+0CA4><U+0CBE>
<U+0C95><U+0CBF> <U+0C9C><U+0CC8>
<U+0001F1EE><U+0001F1F3>\n<U+0CB5><U+0C82><U+0CA6><U+0CC7>
<U+0CAE><U+0CBE><U+0CA4><U+0CB0><U+0C82> <U+0001F1EE><U+0001F1F3>
## 105852

Page | 76
## 105853

## url protected followers_count friends_count listed_count statuses_count


## 105848 <NA> FALSE 6 77 0 610
## 105849 <NA> FALSE 13 258 0 63
## 105850 <NA> FALSE 647 2073 9 83860
## 105851 <NA> FALSE 49 334 0 1147
## 105852 <NA> FALSE 39 208 1 801
## 105853 <NA> FALSE 285 1326 0 3480
## favourites_count account_created_at verified profile_url
## 105848 659 2020-01-08 13:06:03 FALSE <NA>
## 105849 2785 2019-08-07 04:35:56 FALSE <NA>
## 105850 15036 2010-04-30 15:30:39 FALSE <NA>
## 105851 1824 2019-08-19 15:10:40 FALSE <NA>
## 105852 1378 2013-08-14 06:24:51 FALSE <NA>
## 105853 16825 2016-03-04 13:07:19 FALSE <NA>
## profile_expanded_url account_lang
## 105848 <NA> NA
## 105849 <NA> NA
## 105850 <NA> NA
## 105851 <NA> NA
## 105852 <NA> NA
## 105853 <NA> NA
## profile_banner_url
## 105848 <NA>
## 105849 <NA>
## 105850 https://pbs.twimg.com/profile_banners/138781014/1495135789
## 105851 https://pbs.twimg.com/profile_banners/1163468236496617474/1578248233
## 105852 <NA>
## 105853 <NA>
## profile_background_url
## 105848 <NA>
## 105849 <NA>
## 105850 http://abs.twimg.com/images/themes/theme1/bg.png
## 105851 <NA>
## 105852 http://abs.twimg.com/images/themes/theme1/bg.png
## 105853 <NA>
##
profile_image_url
## 105848
http://pbs.twimg.com/profile_images/1214897021569470464/PsmQvyxI_normal.jpg
## 105849
http://pbs.twimg.com/profile_images/1158960946838044673/8WZueqx5_normal.jpg
## 105850
http://pbs.twimg.com/profile_images/688317640897638403/ELaY-ZEX_normal.jpg
## 105851
http://pbs.twimg.com/profile_images/1163468598603444224/Ia-OmyqY_normal.jpg
## 105852
http://pbs.twimg.com/profile_images/378800000293680962/f80a13a608555e74bc6c43f883e9eb03_n
ormal.jpeg
## 105853
http://pbs.twimg.com/profile_images/1207229736100884480/s4fAGDXh_normal.jpg

Retrive orginal tweets

# Remove retweets
movie_tweets_original <- post_tweets_data[post_tweets_data$is_retweet==FALSE, ]
# Remove replies
movie_tweets_original <- subset(movie_tweets_original,
is.na(movie_tweets_original$reply_to_status_id))

Page | 77
favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets)

# favorite_count
movie_tweets_original <- movie_tweets_original %>% arrange(-favorite_count)
movie_tweets_original[1,5]
## [1] "2020-01-11 10:05:20"
#retweet_count
movie_tweets_original <- movie_tweets_original %>% arrange(-retweet_count)
movie_tweets_original[1,5]
## [1] "2020-01-11 10:05:20"

SHOW THE RATIO OF REPLIES/RETWEETS/ORIGINAL TWEETS

# dataset containing only the retweets and one containing only the replies.

# Keeping only the retweets


movie_retweets <- post_tweets_data[post_tweets_data$is_retweet==TRUE,]

# Keeping only the replies


movie_replies <- subset(post_tweets_data, !is.na(post_tweets_data$reply_to_status_id))

Create a separate data frame containing the number of original tweets, retweets, and replies

# Creating a data frame

original_count <- nrow(movie_tweets_original)


retweets_count <- nrow(movie_retweets)
replies_count <- nrow(movie_replies)

movie_data <- data.frame(


category=c("Original", "Retweets", "Replies"),
count=c(original_count, retweets_count, replies_count )
)
# Adding columns
movie_data$fraction = movie_data$count / sum(movie_data$count)
movie_data$percentage = movie_data$count / sum(movie_data$count) * 100
movie_data$ymax = cumsum(movie_data$fraction)
movie_data$ymin = c(0, head(movie_data$ymax, n=-1))

# Rounding the movie_data to two decimal points


movie_data <- round_df(movie_data, 2)

# Specify what the legend should say


Type_of_Tweet <- paste(movie_data$category, movie_data$percentage, "%")
ggplot(movie_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type_of_Tweet)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")

Page | 78
SHOW THE SOURCE DEVICES FROM WHERE THE TWEETS ARE PUBLISHED

tw_app <- pre_tweets_data %>%


select(source) %>%
group_by(source) %>%
summarize(count=n())
tw_app <- subset(tw_app, count > 1000)

device_data <- data.frame(


category=tw_app$source,
count=tw_app$count
)
device_data$fraction = device_data$count / sum(device_data$count)
device_data$percentage = device_data$count / sum(device_data$count) * 100
device_data$ymax = cumsum(device_data$fraction)
device_data$ymin = c(0, head(device_data$ymax, n=-1))
device_data <- round_df(device_data, 2)
Source <- paste(device_data$category, device_data$percentage, "%")
ggplot(device_data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built
initially
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")

Page | 79
SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS

#Cleaning the data


movie_tweets_original$text <- gsub("https\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("@\\S*", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("amp", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[\r\n]", "", movie_tweets_original$text)
movie_tweets_original$text <- gsub("[[:punct:]]", "", movie_tweets_original$text)

# remove stop words from the text

tweets <- movie_tweets_original %>%


select(text) %>%
unnest_tokens(word, text)
tweets <- tweets %>%
anti_join(stop_words)
## Joining, by = "word"

Plot the most frequent words found in the tweets

# gives a bar chart of the most


frequent words found in the tweets
tweets %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in the tweets of the movie",
subtitle = "Stop words removed from the list")

Page | 80
## Selecting by n

SHOW THE MOST FREQUENTLY USED HASHTAGS

movie_tweets_original$hashtags <- as.character(movie_tweets_original$hashtags)


movie_tweets_original$hashtags <- gsub("c\\(", "", movie_tweets_original$hashtags)
set.seed(1234)
wordcloud(movie_tweets_original$hashtags, min.freq=50, scale=c(2, 1), random.order=FALSE,
rot.per=0.35, colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 81
SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE

set.seed(1234)
wordcloud(post_tweets_data$retweet_screen_name, min.freq=200, scale=c(2, .5),
random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 82
SHOWS THE LOCATION FROM WHICH THE MOST TWEETS BELONGS

set.seed(1234)

wordcloud(post_tweets_data$location, min.freq=400, scale=c(3, 1), random.order=FALSE,


rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 83
#wordcloud(tweets_data$retweet_location, min.freq=500,colors=brewer.pal(8, "Dark2"))

SHOWS THE LOCATION FROM WHICH THE MOST RETWEETS BELONGS

set.seed(1234)

wordcloud(post_tweets_data$retweet_location, min.freq=200, scale=c(3, 1),


random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Page | 84
#wordcloud(tweets_data$retweet_location, min.freq=500,colors=brewer.pal(8, "Dark2"))

PERFORM A SENTIMENT ANALYSIS OF THE TWEETS ( “syuzhet” package )

# Converting tweets to ASCII to trackle strange characters


tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")

# removing retweets, in case needed


tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)

# removing mentions, in case needed


tweets <-gsub("@\\w+","",tweets)

ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))
names(sentimentscores) <- "Score"
sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)
rownames(sentimentscores) <- NULL
ggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiments")+ylab("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()

Page | 85
CLEANING THE DATA

post_tweets_data$text = gsub("&amp", "", post_tweets_data$text)


post_tweets_data$text = gsub("&amp", "", post_tweets_data$text)
post_tweets_data$text = gsub("rt|RT", "", post_tweets_data$text) # remove Retweet
post_tweets_data$text = iconv(post_tweets_data$text, "latin1", "ASCII", sub="") # Remove
emojis/dodgy unicode
post_tweets_data$text = gsub("<(.*)>", "", post_tweets_data$text) # Remove pesky Unicodes
like <U+A>
post_tweets_data$text = gsub("https(.*)*$", "", post_tweets_data$text) # remove tweet URL
post_tweets_data$text = gsub("www[[:alnum:][:punct:]]*","",
tolower(post_tweets_data$text ))
post_tweets_data$text = gsub("<.*?>", "", post_tweets_data$text) # remove html tags
post_tweets_data$text = gsub("@\\w+", "", post_tweets_data$text) # remove at(@)
post_tweets_data$text = gsub("[[:punct:]]", "", post_tweets_data$text) # remove
punctuation
post_tweets_data$text = gsub("\r?\n|\r", " ", post_tweets_data$text) # remove /n
post_tweets_data$text = gsub("[[:digit:]]", " ", post_tweets_data$text) # remove
numbers/Digits
post_tweets_data$text = gsub("[ |\t]{2,}", " ", post_tweets_data$text) # remove tabs
post_tweets_data$text = gsub("^ ", "", post_tweets_data$text) # remove blank spaces at
the beginning
post_tweets_data$text = gsub(" $", "", post_tweets_data$text) # remove blank spaces at
the end

head(post_tweets_data$text)
## [1] "movie tanhaji is not holiday release so comparing its advance with holiday
releases only show ur jealous soul"

## [2] "i can see a ceain honesty in the early reviews of tanhaji which are coming out
this is so bloody rare in these times and the kind of things people are sayingi am now so
so excited to catch it at the earliest tanhajireview"

Page | 86
## [3] "onewordreview tanhaji superb rating cr film tanhajireview"

## [4] "tanhajitheunsungwarrior media screening repo is epic my friend who is attending


the screening in mumbai saying its best film of career tanhaji"

## [5] "what a film yaaar tanhaji tanhajireview"

## [6] "onewordreview tanhaji superb rating cr film tanhajireview"

Create subset of tweets text

set.seed(777) # Make process reproducible


sub_blogs =
post_tweets_data$text[sample(length(post_tweets_data$text),length(post_tweets_data$text)*
0.1)] # make subset

Creating a corpus and cleaning data

sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus


sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white
spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeWords , stopwords("english")) # Remove
english stop words

Tokenizing, calculating frequencies and making plots of n-grams

n_grams_plot <- function(n, data) {

options(mc.cores=1)

# Builds n-gram tokenizer


tk <- function(x)
NGramTokenizer(x, Weka_control(min = n, max = n))
# Create matrix
ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
# make matrix for easy view
ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
# find 20 most frequent n-grams in the matrix
ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))

# plots
ggplot(ngrams_matrix, aes(x=word, y=freq)) +
geom_bar(stat="Identity", fill="pink", colour="black") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") +
ylab("Frequency")
}

Plot of frequency distribution of 1-gram

n_grams_plot(n=1, data=sub_blogs_Corpus)

Page | 87
Plot of frequency distribution of 2-gram

n_grams_plot(n=2, data=sub_blogs_Corpus)

Plot of frequency distribution of 3-gram

n_grams_plot(n=3, data=sub_blogs_Corpus)

Page | 88
Plot of frequency distribution of 4-gram

n_grams_plot(n=4, data=sub_blogs_Corpus)

Create Corpus and define the rating of the movie based on the score given by get_sentiment function

sent.value <- get_sentiment(post_tweets_data$text)

Page | 89
corpus_tw = Corpus(VectorSource(post_tweets_data$text))

corpus_tw = tm_map(corpus_tw, tolower)


## Warning in tm_map.SimpleCorpus(corpus_tw, tolower): transformation drops
## documents
corpus_tw = tm_map(corpus_tw, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_tw, removePunctuation): transformation
## drops documents
corpus_tw = tm_map(corpus_tw, removeWords, c(stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus_tw, removeWords, c(stopwords("english"))):
## transformation drops documents
corpus_tw = tm_map(corpus_tw, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus_tw, stemDocument): transformation drops
## documents
frequencies_tw = DocumentTermMatrix(corpus_tw)

sparse_tw = removeSparseTerms(frequencies_tw, 0.995)

sparse_tw.df = as.data.frame(as.matrix(sparse_tw))

colnames(sparse_tw.df) = make.names(colnames(sparse_tw.df))
#category_sentiment <- ifelse(sent.value < 0, "Bad", ifelse(sent.value >= 0 & sent.value
< 0.5, "Average",
# ifelse(sent.value >= 0.5 & sent.value < 1, "Good",
# ifelse(sent.value >=1 & sent.value < 1.5 ,"Very
Good","Excellent"))))

#sparse_tw.df$Polarity = category_sentiment

Classify the tweets based on the scores provided by get_sentiment function into 5 categories.

category_sentiment <- ifelse(sent.value < 0, 1, ifelse(sent.value == 0 , "Ignore",


ifelse(sent.value > 0 & sent.value < 0.5, 2,
ifelse(sent.value >=0.5 & sent.value < 1 ,3,
ifelse(sent.value >= 1 & sent.value < 1.5 ,4,5)))))

sparse_tw.df$Polarity = category_sentiment

table(sparse_tw.df$Polarity)
##
## 1 2 3 4 5 Ignore
## 8942 8203 17346 15614 24895 30853

Remove the data which has Ignore value in their Y column

sparse_tw_new.df <- filter(sparse_tw.df, Polarity != "Ignore")

table(sparse_tw_new.df$Polarity)

Page | 90
##
## 1 2 3 4 5
## 8942 8203 17346 15614 24895
ggplot(data=sparse_tw_new.df, aes(Polarity, fill=Polarity))+ geom_bar()

BUILD CLASSIFICATION MODELS AND PREDICT FOR POST RELEASE ANALYSIS

We will use different classification models and check its accuracy and performance. Polarity
will be the dependant variable.
prop.table(table(sparse_tw_new.df$Polarity))
##
## 1 2 3 4 5
## 0.1192267 0.1093733 0.2312800 0.2081867 0.3319333

Extract random 5000 observations from the DTM

#model_data <- sparse_tw_new.df %>% sample_frac(0.10)

model_data <- sample_n(sparse_tw_new.df, 5000)


dim(model_data)
## [1] 5000 461

Split the data into 70-30 ratio for Train and Test

library(caTools)

set.seed(777)

model_data$Polarity <- as.factor(model_data$Polarity)

spl = sample.split(model_data$Polarity, SplitRatio = 0.7)

Page | 91
train_data = subset(model_data, spl == TRUE)
test_data = subset(model_data, spl == FALSE)

prop.table(table(train_data$Polarity))
##
## 1 2 3 4 5
## 0.1114286 0.1088571 0.2397143 0.2117143 0.3282857
prop.table(table(test_data$Polarity))
##
## 1 2 3 4 5
## 0.1113333 0.1086667 0.2400000 0.2120000 0.3280000

Build CART Model

# Load the Libraries


library(rpart)
library(rpart.plot)

movie_cart_model = rpart(Polarity ~ ., data=train_data, method="class")

#CART Diagram
prp(movie_cart_model, extra=2)

Predict and Evaluate the Performance of CART train data

predict_cart_train_post = predict(movie_cart_model, data=train_data, type="class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_cart <- table(train_data$Polarity, predict_cart_train_post)
confusion_matrix_cart

Page | 92
## predict_cart_train_post
## 1 2 3 4 5
## 1 102 6 0 0 282
## 2 0 150 0 0 231
## 3 1 2 181 1 654
## 4 1 0 8 392 340
## 5 0 0 6 2 1141
# Baseline accuracy
accuracy_cart_train_post = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_train_post
## [1] 0.5617143

Predict and Evaluate the Performance of CART on test data

predict_cart_test_post = predict(movie_cart_model, newdata=test_data, type="class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_cart <- table(test_data$Polarity, predict_cart_test_post)
confusion_matrix_cart
## predict_cart_test_post
## 1 2 3 4 5
## 1 33 5 0 2 127
## 2 0 68 0 0 95
## 3 2 0 72 0 286
## 4 0 2 4 170 142
## 5 0 0 4 3 485
# Baseline accuracy
accuracy_cart_test_post = sum(diag(confusion_matrix_cart))/sum(confusion_matrix_cart)
accuracy_cart_test_post
## [1] 0.552

AUC-ROC Curve for CART on Train and Test dataset

library(pROC)

#Train data - Plot ROC curve


roc_obj_cart_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_cart_train_post),quiet
= TRUE)
roc_obj_cart_train_post
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_cart_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_train_post) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.6012
#Test data - Plot ROC curve
roc_obj_cart_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_cart_test_post),quiet =
TRUE)
roc_obj_cart_test_post
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =

Page | 93
as.numeric(predict_cart_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_post) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.5985

Comparison of all the performace measure of CART Model on Train and Test dataset

results_cart_train_post = data.frame(accuracy_cart_train_post,
as.numeric(roc_obj_cart_train_post$auc))
names(results_cart_train_post) = c("ACCURACY", "AUC-ROC" )

results_cart_test_post =
data.frame(accuracy_cart_test_post,as.numeric(roc_obj_cart_test_post$auc) )
names(results_cart_test_post) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_cart_train_post, results_cart_test_post)


row.names(df_fin) = c('CART Train Post', 'CART Test Post')
df_fin
## ACCURACY AUC-ROC
## CART Train Post 0.5617143 0.6011670
## CART Test Post 0.5520000 0.5984982

Build Random Forest Model

# Load Library
library(randomForest)

set.seed(777)

movie_rf_model = randomForest(Polarity ~ ., data=train_data,importance=TRUE)


movie_rf_model
##
## Call:
## randomForest(formula = Polarity ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 10.06%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 302 4 60 4 20 0.22564103
## 2 13 311 37 6 14 0.18372703
## 3 11 6 782 9 31 0.06793802
## 4 5 2 39 654 41 0.11740891
## 5 10 1 26 13 1099 0.04351610

Predict and Evaluate the Performance of Random Forest on train

# Make predictions:
predict_rf_train_post = predict(movie_rf_model, data=train_data,type="response")

# Evaluate the performance - Confusion matrix :


confusion_matrix_rf <- table(train_data$Polarity, predict_rf_train_post)
confusion_matrix_rf

Page | 94
## predict_rf_train_post
## 1 2 3 4 5
## 1 302 4 60 4 20
## 2 13 311 37 6 14
## 3 11 6 782 9 31
## 4 5 2 39 654 41
## 5 10 1 26 13 1099
# Baseline accuracy:
accuracy_rf_train_post = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_train_post
## [1] 0.8994286

Predict and Evaluate the Performance of Random Forest on test data

# Make predictions:
predict_rf_test_post = predict(movie_rf_model, newdata=test_data,type="response")

# Evaluate the performance - Confusion matrix :


confusion_matrix_rf <- table(test_data$Polarity, predict_rf_test_post)
confusion_matrix_rf
## predict_rf_test_post
## 1 2 3 4 5
## 1 129 3 20 4 11
## 2 4 140 11 1 7
## 3 10 0 342 1 7
## 4 0 1 13 292 12
## 5 2 1 11 6 472
# Baseline accuracy:
accuracy_rf_test_post = sum(diag(confusion_matrix_rf))/sum(confusion_matrix_rf)
accuracy_rf_test_post
## [1] 0.9166667

Variable Importance of Random Forest

#Variable importance:
#varImpPlot(movie_rf_model,main='Variable Importance Plot: Movie ',type=2)
varImpPlot(movie_rf_model,sort=TRUE,type=NULL, class=NULL, scale=TRUE,cex=.8)

Page | 95
AUC-ROC Curve for CART on Train and Test dataset

#Train data - Plot ROC curve


roc_obj_rf_train_post <- multiclass.roc(as.numeric(train_data$Polarity),
as.numeric(predict_rf_train_post),quiet = TRUE)
roc_obj_rf_train_post
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_rf_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_rf_train_post) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.9174
#Test data - Plot ROC curve
roc_obj_rf_test_post <- multiclass.roc(as.numeric(test_data$Polarity),
as.numeric(predict_cart_test_post),quiet = TRUE)
roc_obj_rf_test_post
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_cart_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_cart_test_post) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.5985

Comparison of all the performace measure of Random Forest Model on Train and Test dataset

results_rf_train_post = data.frame(accuracy_rf_train_pre,
as.numeric(roc_obj_rf_train_post$auc))
names(results_rf_train_post) = c("ACCURACY", "AUC-ROC" )

Page | 96
results_rf_test_post =
data.frame(accuracy_rf_test_pre,as.numeric(roc_obj_rf_test_post$auc) )
names(results_rf_test_post) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_rf_train_post, results_rf_test_post)


row.names(df_fin) = c('Random Forest Train Post', 'Random Forest Test Post')
df_fin
## ACCURACY AUC-ROC
## Random Forest Train Post 0.9354286 0.9173600
## Random Forest Test Post 0.9326667 0.5984982

Build SVM Model

set.seed(123)
library(e1071)

movie_svm_model = svm(Polarity ~ . , data = train_data)


## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'format' and 'backstab' and 'thackrey' and 'alluarjun' and 'boestim'
## and 'earlytrend' constant. Cannot scale data.

Predict and Evaluate the performance of SVM Model on train data

# Make predictions:
predict_svm_train_post = predict(movie_svm_model, data=train_data, decision.values=TRUE)

# Evaluate the performance - Confusion matrix :


confusion_matrix_svm <- table(train_data$Polarity, predict_svm_train_post)
confusion_matrix_svm
## predict_svm_train_post
## 1 2 3 4 5
## 1 120 0 209 2 59
## 2 0 185 117 3 76
## 3 0 0 763 1 75
## 4 0 0 189 401 151
## 5 0 0 82 7 1060
# Baseline accuracy:
accuracy_svm_train_post = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_train_post
## [1] 0.7225714

Predict and Evaluate the performance of SVM Model on test data

# Make predictions:
predict_svm_test_post = predict(movie_svm_model, newdata=test_data,decision.values=TRUE)

# Evaluate the performance - Confusion matrix :


confusion_matrix_svm <- table(test_data$Polarity, predict_svm_test_post)
confusion_matrix_svm
## predict_svm_test_post
## 1 2 3 4 5
## 1 32 0 99 4 32
## 2 0 83 45 3 32
## 3 0 0 327 0 33
## 4 0 0 74 177 67
## 5 0 0 47 0 445

Page | 97
# Baseline accuracy:
accuracy_svm_test_post = sum(diag(confusion_matrix_svm))/sum(confusion_matrix_svm)
accuracy_svm_test_post
## [1] 0.7093333

AUC-ROC Curve for SVM model on Train and Test dataset

library(pROC)

#Train data - Plot ROC curve


roc_obj_svm_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_svm_train_post),quiet=T
RUE)
roc_obj_svm_train_post
##
## Call:
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_svm_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_svm_train_post) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.7701
#Test data - Plot ROC curve
roc_obj_svm_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_svm_test_post),quiet=TRU
E)
roc_obj_svm_test_post
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_svm_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_svm_test_post) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.7583

Comparison of all the performace measure of SVM Model on Train and Test dataset

results_svm_train_post = data.frame(accuracy_svm_train_post,
as.numeric(roc_obj_svm_train_post$auc))
names(results_svm_train_post) = c("ACCURACY", "AUC-ROC" )

results_svm_test_post =
data.frame(accuracy_svm_test_post,as.numeric(roc_obj_svm_test_post$auc) )
names(results_svm_test_post) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_svm_train_post, results_svm_test_post)


row.names(df_fin) = c('SVM Train Post', 'SVM Test Post')
df_fin
## ACCURACY AUC-ROC
## SVM Train Post 0.7225714 0.7700893
## SVM Test Post 0.7093333 0.7583435

Page | 98
Build Naive Bayes Model

set.seed(777)

movie_nb_model = naiveBayes(Polarity ~ . , usekernel=T, data = train_data)

Predict and Evaluate the performance of NB Model on train data


# Make predictions:
predict_nb_train_post = predict(movie_nb_model, train_data, type = "class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_nb <- table(train_data$Polarity, predict_nb_train_post)
confusion_matrix_nb
## predict_nb_train_post
## 1 2 3 4 5
## 1 267 119 0 4 0
## 2 39 341 0 1 0
## 3 148 550 117 24 0
## 4 131 461 0 149 0
## 5 612 293 13 22 209
# Baseline accuracy:
accuracy_nb_train_post = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_train_post
## [1] 0.3094286

Predict and Evaluate the performance of NB Model on test data

# Make predictions:
predict_nb_test_post = predict(movie_nb_model,newdata = test_data, type = "class")

# Evaluate the performance - Confusion matrix :


confusion_matrix_nb <- table(test_data$Polarity, predict_nb_test_post)
confusion_matrix_nb
## predict_nb_test_post
## 1 2 3 4 5
## 1 98 63 0 6 0
## 2 14 144 0 5 0
## 3 57 234 58 11 0
## 4 54 196 1 67 0
## 5 255 131 7 10 89
# Baseline accuracy:
accuracy_nb_test_post = sum(diag(confusion_matrix_nb))/sum(confusion_matrix_nb)
accuracy_nb_test_post
## [1] 0.304

AUC-ROC Curve for Naive Bayes model on Train and Test dataset

library(pROC)
#library(ROCR)
#Train data - Plot ROC curve
roc_obj_nb_train_post <-
multiclass.roc(as.numeric(train_data$Polarity),as.numeric(predict_nb_train_post),quiet=TR
UE)
roc_obj_nb_train_post
##
## Call:

Page | 99
## multiclass.roc.default(response = as.numeric(train_data$Polarity), predictor =
as.numeric(predict_nb_train_post), quiet = TRUE)
##
## Data: as.numeric(predict_nb_train_post) with 5 levels of
as.numeric(train_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.6423
#Test data - Plot ROC curve
roc_obj_nb_test_post <-
multiclass.roc(as.numeric(test_data$Polarity),as.numeric(predict_nb_test_post),quiet=TRUE
)
roc_obj_nb_test_post
##
## Call:
## multiclass.roc.default(response = as.numeric(test_data$Polarity), predictor =
as.numeric(predict_nb_test_post), quiet = TRUE)
##
## Data: as.numeric(predict_nb_test_post) with 5 levels of
as.numeric(test_data$Polarity): 1, 2, 3, 4, 5.
## Multi-class area under the curve: 0.6245

Comparison of all the performace measure of Naive Bayes Model on Train and Test dataset

results_nb_train_post = data.frame(accuracy_nb_train_post,
as.numeric(roc_obj_nb_train_post$auc))
names(results_nb_train_post) = c("ACCURACY", "AUC-ROC" )

results_nb_test_post =
data.frame(accuracy_nb_test_post,as.numeric(roc_obj_nb_test_post$auc) )
names(results_nb_test_post) = c("ACCURACY", "AUC-ROC")

df_fin =rbind(results_nb_train_post, results_nb_test_post)


row.names(df_fin) = c('Naive Bayes Train Post', 'Naive Bayes Test Post')
df_fin
## ACCURACY AUC-ROC
## Naive Bayes Train Post 0.3094286 0.6422574
## Naive Bayes Test Post 0.3040000 0.6245372

Comparing best of both the models - LDA and SVM with their performance - Accuracy , Sensitivity,
Specificity and AUC-ROC Curve

df_fin =rbind(results_cart_train_post, results_cart_test_post, results_rf_train_post,


results_rf_test_post,
results_svm_train_post,results_svm_test_post,results_nb_train_post,results_nb_test_post)

row.names(df_fin) = c('CART Train Post', 'CART Test Post','Random Forest Train


Post','Random Forest Test Post', 'SVM Train Post','SVM Test Post','Naive Bayes Train
Post','Naive Bayes Test Post')

#install.packages("kableExtra")
library(kableExtra)
print("Model Performance Comparison Metrics ")
## [1] "Model Performance Comparison Metrics "
kable(round(df_fin,2)) %>%
kable_styling(c("striped","bordered"))

Page | 100
Page | 101

You might also like