labook DA
labook DA
labook DA
WORKBOOK FOR
CS – 368
SECTION II
Practical Course on DATA ANALYTICS
Name of Student:
Total Marks : / 15
Converted Marks : /
Signature of Incharge
Date:
Regression Analysis-
• Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables.
• Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed.
• Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
• It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
• In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data.
• "Regression shows a line or curve that passes through all the datapoints on target-predictor graph
in such a way that the vertical distance between the datapoints and the regression line is minimum."
• The distance between datapoints and line tells whether a model has captured a strong relationship or
not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here in this assignment we will learn Linear Regression and
Logistic Regression in detail.
Linear Regression:
• Linear regression is a statistical regression method which is used for predictive analysis.
• It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
• If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
• The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.
Logistic Regression:
• Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
• Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on the concept of probability.
• Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
• Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
• It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
Self-Activity
6. Residual analysis(Check the results of model fitting to know whether the model is satisfactory)
plt.scatter(X_test,y_test,color="green") # Plot a graph with X_test vs y_test
plt.plot(X_train,regressor.predict(X_train),color="red",linewidth=3) # Regressior line showing
plt.title('Regression(Test Set)')
plt.xlabel('HP')
plt.ylabel('MSRP')
plt.show()
Here we plot a scatter plot graph between X_test and y_test datasets and we draw a regression line.
plt.scatter(X_train,y_train,color="blue") # Plot a graph with X_train vs y_train
plt.plot(X_train,regressor.predict(X_train),color="red",linewidth=3) # Regressior line showing
plt.title('Regression(training Set)')
plt.xlabel('HP')
plt.ylabel('MSRP')
plt.show()
Here we plot the final X_train vs y_train scatterplot graph with a best-fit regression line. Here we
can clearly understand the regression line.
Sample Example -
Goal is to build a logistic regression model in Python in order to determine whether candidates would get
admitted to a prestigious university.
Here, there are two possible outcomes: Admitted (represented by the value of ‘1’) vs. Rejected (represented
by the value of ‘0’).
You can then build a logistic regression in Python, where:
• The dependent variable represents whether a person gets admitted; and
• The 3 independent variables are the GMAT score, GPA and Years of work experience
2. Reading and understanding the data(eventually do appropriate transformations- cleaning, filling nulls,
duplicates, etc…)
data = pd.read_csv("C:\TYBSC\Student_Score.csv") # dataset
logistic_regression.fit(x_train,y_train)
y_pred=logistic_regression.predict(x_test)
6. Print test data and predicted data Predictions on the test set
Diving Deeper into the Results -> print two components in the python code:
print (x_test)
print (y_pred)
Recall that our original dataset (from step 1) had 40 observations. Since we set the test size to 0.25, then the
confusion matrix displayed the results for 10 records (=40*0.25). These are the 10 test records:
The prediction was also made for those 10 records (where 1 = admitted, while 0 = rejected):
In the actual dataset (from step-1), you’ll see that for the test data, we got the correct results 8 out of 10
times:
SET B
1. Build a simple linear regression model for Fish Species Weight Prediction. (download dataset
https://www.kaggle.com/aungpyaeap/fish-market?select=Fish.csv )
2. Use the iris dataset. Write a Python program to view some basic statistical details like percentile,
mean, std etc. of the species of 'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'. Apply logistic
regression on the dataset to identify different species (setosa, versicolor, verginica) of Iris
flowers given just 4 features: sepal and petal lengths and widths.. Find the accuracy of the
model.
Assignment Evaluation
Objectives
● To understand the impact of finding frequent patterns from large datasets.
● To learn the Apriori Algorithm which is used for frequent itemsets mining.
● To understand Association Rule Mining.
● To write and learn implementation of such concepts with Python.
Reading
You should read the following topics before starting this exercise:
● Why Pre-processing is must before analysis of data.
● What is support, confidence and lift.
● Learn definitions such as frequent itemsets, association between things, Apriori
Property of sets.
● Basic understanding of libraries supported in Python for performing these tasks.
Ready Reference
Frequent Itemset Mining: Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases, relational databases, and
other information repositories.
Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational
databases are found.
If there are 2 items X and Y purchased frequently then it is good to put them together in stores
or provide some discount offer on one item on purchase of other item. This can really increase
the sales. For example it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’].
Applications: Market Basket Analysis is one of the key techniques used by large retailers to
uncover associations between item, catalog design, loss-leader analysis, clustering,
classification, recommendation systems, etc.
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent itemset properties. Apriori employs an iterative approach known
as a levelwise search, where k-itemsets are used to explore (k+1)-itemsets
To construct association rules between items, the algorithm considers 3 important factors which
are, support, confidence and lift. Each of these factors is explained as follows:
Support:
The support of item I is defined as the ratio between the number of transactions containing the
item I by the total number of transactions expressed as :
Confidence:
This is measured by the proportion of transactions with item I1, in which item I2 also appears.
Given that the item on the left hand side (antecedent) is purchased then the item on the right
hand side(consequent) would also be purchased.
Lift:
Lift is the ratio between the confidence and support expressed as :
Lift (antecedent => consequent) = 1 means that there is no correlation within the itemset, > 1
means that there is a positive correlation within the itemset, i.e., products in the
itemset, antecedent, and consequent, are more likely to be bought together, < 1 means that there
is a negative correlation within the itemset, i.e., products in itemset, antecedent, and consequent,
are unlikely to be bought together.
1. Define the minimum support and confidence for the association rule
2. Take all the subsets in the transactions with higher support than the minimum support
3. Take all the rules of these subsets with higher confidence than minimum confidence
4. Sort the association rules in the decreasing order of lift.
5. Visualize the rules along with confidence and support.
In this assignment you will analyze collections of market baskets and will determine frequent
itemsets and association rules present in the collections.
Python libraries
Python has many libraries for apriori implementation.
i. Mlxtend (apriori)
ii. Apyori (apriori)
iii. pypi (efficient_apriori)
The apriori module from mlxtend library provides fast and efficient apriori implementation.
Usage:
apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,
low_memory=False)
Parameters
• df : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as
values
• min_support : Floating point value between 0 and 1 that indicates the minimum support
required for an itemset to be selected.
# of observation with item / total observation# of observation with item / total observation
• use_colnames : This allows to preserve column names for itemset making it more
readable.
• max_len : Max length of itemset generated. If not set, all possible lengths are evaluated.
• verbose : Shows the number of iterations if >= 1 and low_memory is True. If =1 and
low_memory is False , shows the number of combinations.
• low_memory :
• If True, uses an iterator to search for combinations above min_support. Note that while
low_memory=True should only be used for large dataset if memory resources are limited,
because this implementation is approx. 3–6x slower than the default.
The function returns a pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support and < than max_len (if max_len is not None).
Leverage computes the difference between the observed frequency of A and C appearing
together and the frequency that would be expected if A and C were independent. A leverage
value of 0 indicates independence.
A high conviction value means that the consequent is highly depending on the antecedent.
Self-Activity
The dataset contains a set of transactions with a set of text items. This needs to be converted
into numerical form to be analyzed. The label encoding process is used to convert textual labels
into numeric form in order to prepare it to be used in a machine-readable form. We can
transform it into the right format via the TransactionEncoder as follows:
from mlxtend.preprocessing import TransactionEncoder
te=TransactionEncoder()
te_array=te.fit(transactions).transform(transactions)
df=pd.DataFrame(te_array, columns=te.columns_)
df
Dataset Sources
https://www.kaggle.com/datasets/sivaram1987/association-rule-learningapriori
https://github.com/shivang98/Market-Basket-Optimization
https://www.kaggle.com/datasets/hemanthkumar05/market-basket-optimization
https://www.kaggle.com/datasets/irfanasrullah/groceries
Lab Assignments
SET A:
1. Create the following dataset in python
Apply the apriori algorithm on the above dataset to generate the frequent itemsets and
association rules. Repeat the process with different min_sup values.
2. Create your own transactions dataset and apply the above process on your dataset.
SET B:
SET C:
Write a python code to implement the apriori algorithm. Test the code on any standard dataset.
Assignment Evaluation
Objectives
• To understand the concept of sentiment analysis.
• To learn various methodologies for analysis on text including text analytics, tokenization, frequency
distribution, stopwords, stemming, lemmatization, part-of-speech tagging.
• To write the Python scripts using various libraries for sentiment analysis using natural language processing
toolkit and classifying emotions on basis of labels i.e. Positive, Negative and Neutral. Also to use wordcloud
package for words comparison.
• To perform analysis on social media data such as Facebook, Twitter, YouTube.
• To graphically represent the analyzed data.
Reading
You should read the following topics before starting the exercise :
What is the need of doing data analysis using natural language processing. Basics of Python libraries such as
pandas, matplotlib, numpy, scikit-learn, nltk, VADER tool to perform the data analysis.
Ready Reference
Python Libraries for performing text and Sentiment Analysis :
Natural Language Toolkit (NLTK) :
NLTK is a Python Package for performing Natural Language Processing on human language data which is
mostly unstructured. It mainly focuses on analyzing textual data. It supports different natural language
processing algorithms such as Tokenization, Frequency Distribution, Stopwords, Lexicon Normalization,
Stemming, Lemmatization, POS Tagging. These are considered as pre-processing steps to perform text
analytics.
Installation of NLTK : You can use any IDE to perform Python programming for the following tasks. Here
Spyder IDE is used.
After running the above script, a screen will come to download the packages. Here click on download to
download all the supporting NLTK packages.
You can also download all NLTK packages using Python statement :
nltk.download(‘all’)
If all the packages are not needed, then individual packages can also be installed by passing its name in
nltk.download().
Syntax : nltk.download(‘package_name’)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Tokenization : It is the first step to perform text analytics. Tokenization means breaking down a textual
paragraph into small chunks such as words or sentences. It is classified into two sections :
Sentence Tokenization and Word Tokenization : Sentence Tokenization breaks the text into sentences
whereas Word Tokenization breaks the text into words.
Example :
Output :
Tokenized Sentences :
['Hello all, Welcome to Python Programming Academy.', 'Python Programming Acade
my is a nice platform to learn new programming skills.', 'It is difficult to get
enrolled in this Academy.']
Tokenized Words :
['Hello', 'all', ',', 'Welcome', 'to', 'Python', 'Programming', 'Academy', '.',
'Python', 'Programming', 'Academy', 'is', 'a', 'nice', 'platform', 'to', 'learn'
, 'new', 'programming', 'skills', '.', 'It', 'is', 'difficult', 'to', 'get', 'en
rolled', 'in', 'this', 'Academy', '.']
Frequency Distribution :
The frequency distribution helps to understand how many words have occurred how many times in the
given textual data.
Example :
# Import word_tokenize
from nltk.tokenize import word_tokenize
# Import FreqDist package belonging to nltk.probability
from nltk.probability import FreqDist
# Textual data for word tokenization
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output :
<FreqDist with 24 samples and 32 outcomes>
To find most common words using Frequency Distribution, add the following lines in above code :
print(frequency_distribution.most_common(2))
Output :
Output :
Stopwords : Stopwords are considered as Noise in textual data. For example if text is containing words
such as is, are, am, a, this, the, an etc. then they are treated as stopwords.
These stopwords needs to be removed from actual text for further processing. Using NLTK, first identify
and create a list of stopwords in given text. Then remove it from the original content. Before working with
stopwords, make sure to download it by using following :
import nltk
nltk.download('stopwords')
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output :
{'wouldn', 'down', 'was', 'any', 'themselves', 'on', 'how', 'y', 'them', 'do',
'as', "couldn't", 'wasn', 'can', 'yourself', "mightn't", 'm', "wasn't", 'yours',
"haven't", 'have', 'their', 'from', 'with', 'through', 'been', 'couldn', 'here',
'your', 'above', 'same', 'ours', 'now', 'isn', 'that', 'just', 'further',
'only', "won't", 'having', 'these', 'won', 'himself', 'ourselves', 'which',
"you're", 'while', 'of', "doesn't", "should've", "mustn't", 'hadn', 'are',
'not', 'he', 'she', 'am', 'an', 'most', 'whom', 'where', 'than', 'didn',
"isn't", 'shouldn', 'what', 'mustn', 'some', 'very', 'should', 'ain', "you'd",
'yourselves', 'own', 'but', 'we', 't', 'out', 'such', 'in', 've', 'this',
'shan', 'about', 'over', 'both', 'all', 'why', 'i', 'being', "wouldn't", 'll',
'myself', 'between', 'has', "didn't", 'hers', 'hasn', "she's", 'other', 'if',
'itself', 'below', "aren't", 'too', 'under', 'herself', 'be', 'after', 'off',
're', 'during', 'until', 'our', "shouldn't", 'into', 'don', 'again', 'nor',
'needn', "that'll", "weren't", 'no', 'so', 'then', 'before', 'his', 'its',
'few', 'doing', "don't", "you'll", "hadn't", 'because', 'there', 'did', 'my',
"needn't", "it's", 'they', 'for', 'does', 'is', 'a', 'against', 'who', 'and',
"shan't", 'o', 'weren', 'him', 'or', 'theirs', 'were', 'had', 'doesn', 'you',
'haven', 'those', 'me', 'when', 's', 'd', 'it', 'up', 'by', 'each', 'once',
'aren', "you've", 'her', "hasn't", 'to', 'more', 'will', 'mightn', 'the', 'at',
'ma'}
Removing Stopwords :
The above words in the output are predefined stopwords in English Language. If either of these words occur in
a user-defined textual data, then it can be removed as follows :
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Textual data to remove stopwords
paragraph_text="""Hello all, Welcome to Python Programming Academy. Python
Programming Academy is a nice platform to learn new programming skills. It is
difficult to get enrolled in this Academy."""
# Word Tokenization
tokenized_words=word_tokenize(paragraph_text)
# It will find the stowords in English language.
stop_words_data=set(stopwords.words("english"))
# Create a stopwords list to filter it from original text
filtered_words_list=[]
for words in tokenized_words:
if words not in stop_words_data:
filtered_words_list.append(words)
print("Tokenized Words : \n",tokenized_words,"\n")
print("Filtered Words : \n",filtered_words_list,"\n")
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output :
Tokenized Words :
Filtered Words :
Stemming : Stemming is a process of linguistics normalization to reduce words to their word root or divide
the derivational affixes. For example : writing, wrote, written can stemmed or reduced as write.
Example :
# Same code as previous example to remove stop words from tokenized words
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
stemmed_text_words=[]
for words in filtered_words_list:
stemmed_text_words.append(porter_stemmer.stem(words))
print("Filtered Words : \n",tokenized_words,"\n")
print("Stemmed Words : \n",stemmed_text_words,"\n")
Lemmatization : Lemmatization is a process of removing words to their base words which is linguistically
correct lemmas. For example : “Running” word will be lemmatized to “run”. Before that download the
package “wordnet” belonging to nltk as follows :
import nltk
nltk.download('wordnet')
# Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
word_text="running"
print("Lemmatized Word : ",lemmatizer.lemmatize(word_text,"v"))
Output :
Lemmatized Word : run
POS Tagging : The POS (Part-of-Speech) tagging is basically used to identify the grammatical group of
given words i.e. Noun, Pronoun, Verb, Adjective, Adverbs etc. on the basis of its context.
Before that download the package “averaged_perceptron_tagger” belonging to nltk as follows :
import nltk
nltk.download('averaged_perceptron_tagger')
# Part-of-Speech Tagging
import nltk
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output :
[('Hello', 'NNP'), ('all', 'DT'), (',', ','), ('Welcome', 'NNP'), ('to', 'TO'),
('Python', 'NNP'), ('programming', 'NN')]
Text Summarization :
Text summarization is an NLP technique that extracts text from a large amount of data. It is the process of
identifying the most important meaningful information in a document and compressing it into a shorter
version by preserving its meaning. Types: Extractive summarization and Abstractive summarization
To perform extractive summarization, we calculate the sentence weights and choose the first ‘n’ sentences
with maximum weight. The weights are calculated on the basis of the word frequencies
Steps:
1. Preprocess the text
2. Create the word frequency table
3. Tokenize the sentence
4. Score the sentences: Term frequency
5. Generate the summary
Sample code
import nltk
nltk.download('all')
#Preprocessing
import re
text="""
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
import nltk
nltk.download('vader_lexicon')
Examples : Let’s consider some text statements expressing different emotions and analyzing them using
VADER.
Example 1 :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
vader_analyzer=SentimentIntensityAnalyzer()
text1="I am feeling good" # The text is positive.
print(vader_analyzer.polarity_scores(text1))
Output :
{'neg': 0.0, 'neu': 0.185, 'pos': 0.815, 'compound': 0.5267}
It has given ‘pos’ value as 0.815 which is maximum of all the other values since the statement is positive.
Similarly, we can check it on other emotions as well.
Example 2 :
Output :
Example 3 : Consider the following example to get the overall rating about a statement
i.e. overall whether it is positive, negative or neutral.
Output :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Word cloud is basically a data visualization technique to represent the textual content where the size of each
visualized word implies its importance, frequency and intensity. It is a good tool to visualize the text and
perform sentiment analysis to find the frequency of words having positive, negative or neutral emotions.
Now to perform sentiment analysis on above dataset and creating a wordcloud, consider the following code :
(Here, we will represent Positive words with green color, Negative words with red color and Neutral words
with white color)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
# Sentiment Analysis
sentiment_analyzer=SentimentIntensityAnalyzer()
for i in words:
if not i.lower() in stop_words_data: # It will remove stopwords.
polarity=sentiment_analyzer.polarity_scores(i)
if polarity['compound']>=0.05: # Positive Sentiment
positive_words[i]=polarity['compound']
if polarity['compound']<=-0.05: # Negative Sentiment
negative_words[i]=polarity['compound']
# Append the positive and negative words from dictionaries to lists i.e.
positive[] and negative[]
for key,value in positive_words.items():
positive.append(key)
for key,value in negative_words.items():
negative.append(key)
# Create a dictionary to mention the colors : green for positive and red for
negative
coloured_words={"green":positive,"red":negative}
def get_colour(self,word):
try:
colour=next(
colour for (colour,words) in self.coloured_words
if word in words)
except StopIteration:
colour=self.default
return colour
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output :
When it comes to social media such as Twitter, Facebook, YouTube etc., bulk of data is available which is
needed to be examined or analyzed for interpreting the opinions of people conveyed in different formats. It
basically puts the subjective information in the form emotions.
To get the tweets through twitter API, Twitter account is needed and App is to be registered. Follow the below
steps :
First create a Twitter account if you do not have one. Visit to https://twitter.com/i/flow/signup and create
an account. Existing account can also be used.
Now create an App on Twitter Developer using following link :
https://developer.twitter.com/en/apps
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Now click on “Create an App” button to create an application to get the API key for
credentials. It will ask to apply for a Developer Account.
Click on Apply and continue. And then answer the questions visible on the screen.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
After submitting the request, you will receive a message from twitter to get the Email
confirmation. Then we can get the keys. Visit the following link for App creation.
https://developer.twitter.com/en/portal/register/welcome
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Or you can also use the “Bearer Token” to perform authentication. For this code, Bearer
Token is used. If you want to use another approach, refer this :
https://docs.tweepy.org/en/stable/authentication.html. You can find Bearer Token here :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Before using Bearer Token, make sure to use “Elevated” section of App. When first time
app gets created, it comes with “Essential”. But to use Bearer Token directly, “Elevated” is
to be used.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
In case if the token gets expired, then it can be regenerated as well. Add the following lines
of code :
auth=tweepy.OAuth2BearerHandler("Your Bearer Token")
api=tweepy.API(auth)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
print(tweet_list)
Output : (Example)
Enter the Hash Tag or Keyword for which you want to get the tweets : #sadhguru
More Analysis on Twitter Data : We can further perform different analysis on gathered
data as follows :
First select the user ID on which analysis is to be done.
Then we can find various information related to tweets such as 'created_at', 'id',
'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id',
'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str',
'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors',
'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
'retweeted', 'lang', 'possibly_sensitive'.
Example :
# Select a specific user by using a twitter user ID.
user_id=input("Enter a Twitter user ID : ")
no_of_tweets=int(input("How many tweets you want ? "))
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Output : (Example)
Tweet ID : 1502113329103487012
Created at : 2022-03-11 02:45:00+00:00
Tweet : Kriya Yoga requires nothing but dedication towards the practice. As you
refine your energies, there is no way you can remain untransformed.
#SadhguruQuotes https://t.co/byjrSIld2u
Retweet count : 1696
Tweet ID : 1501771371562471426
Created at : 2022-03-10 04:06:11+00:00
Tweet : Congratulations @CISFHQrs for your courageous & committed
contribution to Nation Building for more than five decades. Bharat is proud
& grateful for your stellar service. May you continue to inspire Peace &
Prosperity. Best Wishes. –Sg #CISFRaisingDay2022
Retweet count : 1595
Tweet ID : 1501750941187551232
Created at : 2022-03-10 02:45:00+00:00
Tweet : You cannot change the past. You can only experience the present moment.
The future must be crafted the way you want. #SadhguruQuotes
https://t.co/eTCAmU3gOl
Retweet count : 2510
Tweet ID : 1501624364889825281
Created at : 2022-03-09 18:22:02+00:00
Tweet : Machel, #VelliangiriMountains are a Cascade of Grace. Their Power has
empowered millions & will continue to empower future populations. Wonderful
your #Sadhanapada culminated here; it was beautiful to have you & Renee.
Journey on- sing, dance, also transform lives. Blessings. –Sg
https://t.co/y2qV6EBM2k
Retweet count : 1483
Visualizing Twitter Data : We can visualize the twitter data in multiple ways on the
basis of attributes returned by Twitter API.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
twitter_data={'id':tweet_id,'created_at':tweet_created_at,'full_text':tweet_full_text,'r
etweet_count':tweet_retweet_count,'favorite_count':tweet_favorite_count}
# DataFrame
twitter_dataframe=pd.DataFrame(twitter_data)
# Plotting Pie Graph for retweets on each tweet.
twitter_dataframe['retweet_count'].plot.pie()
plt.show()
Output :
Enter a Twitter user ID : Tesla
How many tweets you want ? 10
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Now to plot the likes and re-tweets received on each tweet, add the following script :
Consider the following script to plot the Time Series for likes and re-tweets along with
dates on which the tweets were published.
# Time Series
time_likes=pd.Series(data=twitter_dataframe['favorite_count'].values,index=twitter_dataf
rame['created_at'])
time_likes.plot(figsize=(16,4),label="likes",legend=True,color="magenta")
time_retweets=pd.Series(data=twitter_dataframe['retweet_count'].values,index=twitter_dat
aframe['created_at'])
time_retweets.plot(figsize=(16,4),label="retweets",legend=True,color="blue")
plt.show()
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
You can also get information regarding tweets as total number of likes and re-tweets on
each tweet, which tweet has maximum count of likes and got maximum re-tweets.
Output : (Example)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Tweet ID : 1502475716935442442
Created at : 2022-03-12 02:45:00+00:00
Tweet : Only if you invest your emotions in what matters to you, will life become
powerful and really meaningful. #SadhguruQuotes https://t.co/EWJ2Aneqps
Retweet count : 447
Favorite count : 1247
Tweet ID : 1502375448885432326
Created at : 2022-03-11 20:06:34+00:00
Tweet : #SaveSoil #MoU #CARICOM
@GastonBrowne @AntiguaOpm @SkerritR @PhilipJPierreLC @pmharriskn @antiguagov
@SaintLuciaGov @skngov @molwynjoseph @SamMarshallMP @machelmontano @armandarton
@GlobalCitizenFo @cpsavesoil @PMOIndia https://t.co/RMXpcgW12d
Retweet count : 461
Favorite count : 1026
Tweet ID : 1502375423451164672
Created at : 2022-03-11 20:06:28+00:00
Tweet : A historic moment marked by the first #SaveSoil MoUs signed by the pearls of
the ocean. Governments of Antigua & Barbuda, Dominica, St Lucia, and St Kitts &
Nevis — may your commitment to soil revitalization be an inspiration to the rest of the
world. -Sg @CARICOMorg #CARICOM https://t.co/0glWuMlFBy
Retweet count : 1074
Favorite count : 2806
Tweet ID : 1502151438419464196
Created at : 2022-03-11 05:16:26+00:00
Tweet : Sir Vivian Richards & Lord Ian Botham - a joy to meet you during my Antigua
visit for the #SaveSoil movement. Your achievements in cricket & beyond are
commendable. Please join me in restoring our world’s Soil, the basis of all Life on
Earth. -Sg @ivivianrichards @BeefyBotham https://t.co/M53Ckhu0Lg
Retweet count : 2207
Favorite count : 5964
Tweet ID : 1502113329103487012
Created at : 2022-03-11 02:45:00+00:00
Tweet : Kriya Yoga requires nothing but dedication towards the practice. As you refine
your energies, there is no way you can remain untransformed. #SadhguruQuotes
https://t.co/byjrSIld2u
Retweet count : 1991
Favorite count : 5820
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Sentiment Analysis on Twitter Data : We can also perform sentiment analysis on gathered twitter data.
Here two libraries will be needed, i.e. TextBlob and Vader.
1. textblob : It is a Python library which is used for processing textual data. It is built on top of NLTK
module and offers a simple API to access its methods to perform basic Natural Language Processing
tasks. To install textblob, use the following command :
pip install textblob
2. VADER description has already been given in previous topic for Sentiment Analysis using NLTK.
positive_tweets=0
negative_tweets=0
neutral_tweets=0
polarity_of_tweets=0
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
compound_score=polarity_score['compound']
polarity_of_tweets+=analysis.sentiment.polarity
if negative_score>positive_score:
negative_tweets_list.append(tweet.text)
negative_tweets+=1
elif positive_score>negative_score:
positive_tweets_list.append(tweet.text)
positive_tweets+=1
elif positive_score==negative_score:
neutral_tweets_list.append(tweet.text)
neutral_tweets+=1
positive_tweets=sentiment_percentage(positive_tweets,no_of_tweets)
negative_tweets=sentiment_percentage(negative_tweets,no_of_tweets)
neutral_tweets=sentiment_percentage(neutral_tweets,no_of_tweets)
polarity_of_tweets=sentiment_percentage(polarity_of_tweets,no_of_tweets)
positive_tweets=format(positive_tweets,'.1f')
negative_tweets=format(negative_tweets,'.1f')
neutral_tweets=format(neutral_tweets,'.1f')
Output :
Enter the Hash Tag or Keyword for which you want to get the tweets : amazonIN
Positive Tweets :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Negative Tweeets :
Neutral Tweets :
Downloading the twitter datasets online : The online available datasets containing
Twitter data can be downloaded and different analytics can be performed on it.
Example : https://www.kaggle.com/crowdflower/twitter-user-gender-classification
Self-Activity : Download and analyze the data using above link and apply different
analytics techniques on it.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Now create an app to get the token to be used for further processing.
Click on Create App and then select the type of app to be created. Multiple options will
be available i.e. Business, Consumer, Instant Games, Gaming, Workplace, None. You can
read the details and select an option. If Business option available in list is selected, then it
creates an app which manages business assets like Pages, Events, Groups, Ads,
Messenger and Instagram Graph API using the available business permissions, features
and products.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
After an app gets created, you can get the Access Token as follows :
Go to : https://developers.facebook.com/tools/explorer/
Now click on Generate Access Token. Proceed to the next step by clicking on
Continue.
The Access Token will be visible in Access Token input box.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Now allow the necessary permissions to access the Facebook pages as well.
Click on “Add a Permission” dropdown and select the permissions from it.
From “User or Pages” dropdown select “Get User Token” again to get the Token with
revised permissions.
Now if we want to see the details of publically available Facebook users or pages,
change the request in the request url box.
Example : If we want to get the details of Facebook Page “Sanganak Academy” then
change the name of page as : SanganakAcademy?fields=id,name. Before that,
make sure to allow the permissions for accessing that page as well.
You can see the posts on this page as well by changing the url as :
SanganakAcademy?fields=id,name,posts. Same you can do to access other
information as well.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Provide Access Token and get the URL to access the data :
# Access Token
access_token="Your Access Token"
# In Graph URL, provide the correct version of Graph API. Here currently I am using
v13.0
graphURL="https://graph.facebook.com/v13.0/"
# Request URL to get the relevant data of Facebook Page Sanganak Academy.
# You can access any other page by using its ID as well.
requestURL="SanganakAcademy?fields=id,name,posts{message,created_time,comments.limit(
0).summary(true), likes.limit(0).summary(true)}"
actual_url=graphURL+requestURL
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
time_likes=pd.Series(data=fbdata_with_dates['likes.summary.total_count'].values,index
=fbdata_with_dates['created_time'])
time_likes.plot(figsize=(16,4),label="likes",legend=True)
plt.show()
And then click on “Generate Access Token”. Now copy the generated access token
and use it for further analysis.
Creating a Facebook post : To post something on Facebook Page wall, use put_object()
method of Facebook Graph API.
import facebook
access_token="Your Page Access Token"
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
fb=facebook.GraphAPI(access_token)
fb.put_object(parent_object='me', connection_name='feed', message='Hello all...Welcome
to Sanganak Academy')
Here, 511268930590718 is the Facebook Post ID. To get “Facebook Page ID”, Open the
Facebook Page and in About section, you will find the Facebook Page ID.
Now combine Facebook Page ID and Facebook Post ID as pageid_postid. For example : If
Page ID = 12345 and Post ID = 511268930590718, then combine it as
12345_511268930590718.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Liking a post :
Syntax : graph_api_object.put_like(object_id = ‘post_id’)
Example :
# Liking a facebook post.
fb.put_like(object_id='12345_511268930590718')
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (12, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
fig = plt.figure()
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
plt.show()
i=0
for i in range(youtube_data.shape[0]):
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
date_time_obj = datetime.datetime.strptime(youtube_data['publish_time'].at[i],'%Y-
%m-%dT%H:%M:%S.000Z')
youtube_data['publish_time'].at[i] = date_time_obj
i = i+1
date=[]
year=[]
month=[]
day=[]
for i in range(youtube_data.shape[0]):
d = youtube_data['publish_time'][i].date()
y = youtube_data['publish_time'][i].date().year
m = youtube_data['publish_time'][i].date().month
days = youtube_data['publish_time'][i].date().day
date.append(d) # Storing dates
year.append(y) # Storing years
month.append(m) # Storing months
day.append(d) # Storing days
i = i+1
youtube_data.drop(['publish_time'], inplace=True,axis=1)
youtube_data['publish_time']=date
youtube_data['year']=year
youtube_data['month'] = month
youtube_data['day'] = day
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Lab Assignments
SET A
1. Consider any text paragraph. Preprocess the text to remove any special characters and digits. Generate
the summary using extractive summarization process.
2. Consider any text paragraph. Remove the stopwords. Tokenize the paragraph to extract words and
sentences. Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of
the text.
3. Consider the following review messages. Perform sentiment analysis on the messages.
i. I purchased headphones online. I am very happy with the product.
ii. I saw the movie yesterday. The animation was really good but the script was ok.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
Set B
1. Consider the following dataset :
https://www.kaggle.com/datasets/prasertk/top-1000-instagram-influencers
Write a Python script for the following :
i. Read the dataset and find the top 5 Instagram influencers from India.
ii. Find the Instagram account having least number of followers.
iii. Read the column “Category”, remove stopwords and plot the wordcloud to find the keywords which
will imply that in which category maximum accounts are created.
iv. Group the Instagram accounts category wise.
v. Visualize the dataset and plot the relationship between Followers and Authentic engagement columns.
Set C
Q.2 Write a Python script to read the Tweets using Twitter API and tweepy library to perform the
following tasks :
i. Authenticate Twitter API (Using Bearer Token)
ii. Get the tweets using Keywords or Hash Tags.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)
lOMoARcPSD|15895789
iii. Find the total number of likes and retweets on each tweet.
iv. Find the most liked tweet and print its text.
v. Visualize the tweets and plot the time series for likes and retweets along with dates on which tweets
are published.
1. Import and format the data into a DataFrame using pandas library. Example : For working with Facebook
Posts, read the JSON file available in “Posts” folder as :
import pandas as pd
facebook_dataframe=pd.read_json("your_posts.json")
Similarly you can work with other downloaded data and read the JSON files available in them.
2. Now perform data cleaning operation on created dataframe and remove unnecessary columns.
3. Perform multiple statistical analysis such as finding the posts by date, number of likes on a post, comments
on a post.
4. Perform sentiment analysis to find the polarity scores and classify the posts text in three categories i.e.
positive, negative and neutral posts.
Assignment Evaluation
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Downloaded by Komal Rathod (krrathod1919@gmail.com)