Tweet-Sentiment-Extraction - Exploratory Data Analysis
Tweet-Sentiment-Extraction - Exploratory Data Analysis
Problem
With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company or a
person's brand for being viral (positive) or devastate profit because it strikes a negative tone. Capturing sentiment in language is
important in these times where decisions and reactions are created and updated in seconds. But which words actually lead to the
sentiment description? In this problem we need to pick out the part of the tweet (word or phrase) that reflects the sentiment.
Source of Data
Source of Data It is a Kaggle competition.In this competition we have extracted support phrases from Figure Eight's Data for Everyone
platform.Data is available here https://www.kaggle.com/c/tweet-sentiment-extraction/data
Data Overview
It consists of two data files.train.csv with 27481 rows and test.csv with 3534 rows.
selected_text: phrases /words from the text that best supports the sentiment.
Sentiment: negative
Output: So sad
Performance Metrics
The metric in this problem is the word-level Jaccard score. Jaccard Score or Jaccard Similarity is defined as size of intersection divided
by size of union of two sets.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", forc
e_remount=True).
train.head()
0 cb774db0d1 I`d have responded, if I were going I`d have responded, if I were going neutral
1 549e992a42 Sooo SAD I will miss you here in San Diego!!! Sooo SAD negative
4 358bd9e861 Sons of ****, why couldn`t they put them on t... Sons of ****, negative
test.head()
textID text sentiment
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 textID 27481 non-null object
1 text 27480 non-null object
2 selected_text 27480 non-null object
3 sentiment 27481 non-null object
dtypes: object(4)
memory usage: 858.9+ KB
train.dropna(inplace=True)
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3534 entries, 0 to 3533
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 textID 3534 non-null object
1 text 3534 non-null object
2 sentiment 3534 non-null object
dtypes: object(3)
memory usage: 83.0+ KB
def patch1(bar,ax):
#https://stackoverflow.com/questions/52080991/display-percentage-above-bar-chart-in-matplotlib
for p in bar.patches:
width=p.get_width()
height=p.get_height()
x,y=p.get_xy()
ax.annotate('{}'.format(height),(x+width/2,y+height*1.02),ha='center',fontsize=14)
fig,ax=plt.subplots(figsize=(12,8))
a=train['sentiment'].value_counts().plot(kind='bar')
patch1(a,ax)
ax.set_xlabel("Sentiments",fontsize=18,labelpad=15)
ax.set_ylabel("Count",fontsize=18,labelpad=15)
plt.xticks(rotation='horizontal',fontsize=14)
plt.yticks(fontsize=14)
ax.set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
No. of neutral tweets : 11117
train['sentiment'].value_counts(normalize=True)
neutral 0.404549
positive 0.312300
negative 0.283151
Name: sentiment, dtype: float64
About 40 percent of the tweets are neutral followed by positive and negative tweets.
#https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
#https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
def remove_hyperlinks(text):
hyperlinkfree=re.sub('https?://\S+|www\.\S+', '', text)
return hyperlinkfree
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove_hyperlinks(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove_hyperlinks(x))
#https://stackoverflow.com/questions/12851791/removing-numbers-from-string
#https://stackoverflow.com/questions/18082130/python-regex-to-remove-all-words-which-contains-number
def remove_numbers(text):
numbersfree=re.sub("\S*\d\S*","",text).strip()
return numbersfree
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove_numbers(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove_numbers(x))
#https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove_punctuation(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove_punctuation(x))
#https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/
def token(text):
tokenizer =RegexpTokenizer("[\w']+")
return " ".join(tokenizer.tokenize(text))
train['text']=train['text'].apply(lambda x: token(x))
train['selected_text']=train['selected_text'].apply(lambda x: token(x))
train['text_length']=train['text'].astype(str).apply(len)
train['text_word_count']= train['text'].apply(lambda x: len(str(x).split()))
train['selected_text_word_count']= train['selected_text'].apply(lambda x: len(str(x).split()))
train.head()
Let's create three separate dataframes for positive, neutral and negative sentiments. This will help in analyzing the text statistics
separately for separate polarities.
pos=train[train['sentiment']=='positive']
neg=train[train['sentiment']=='negative']
neutral=train[train['sentiment']=='neutral']
#https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/
fig,ax=plt.subplots(1,3,figsize=(20,6))
ax[0].hist(pos['text_length'],bins=20,color='green')
ax[0].set_facecolor("tan")
ax[0].set_ylabel('positive_count',fontsize=14,labelpad=10)
ax[0].set_xlabel('text_length',fontsize=14,labelpad=10)
ax[1].hist(neg['text_length'],bins=20,color='red')
ax[1].set_facecolor("tan")
ax[1].set_ylabel('negative_count',fontsize=14,labelpad=10)
ax[1].set_xlabel('text_length',fontsize=14,labelpad=10)
ax[2].hist(neutral['text_length'],bins=20,color='yellow')
ax[2].set_facecolor("tan")
ax[2].set_ylabel('neutral_count',fontsize=14,labelpad=10)
ax[2].set_xlabel('text_length',fontsize=14,labelpad=10)
fig.set_facecolor("tan")
plt.show()
The histogram shows that the length of the cleaned text ranges maximum upto aprox. 140 characters.
All the text appear to have more or less same length. Hence length of the text is not a powerful indicator of the polarity.
#https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/
fig,ax=plt.subplots(figsize=(8,6))
sns.kdeplot(train[train['sentiment']=='positive']['text_length'],color='r',Label='positive',ax=ax)
sns.kdeplot(train[train['sentiment']=='negative']['text_length'],color='b',Label='negative',ax=ax)
sns.kdeplot(train[train['sentiment']=='neutral']['text_length'],color='g',Label='neutral',ax=ax)
ax.set_title('Tweets Distribution',loc='center',fontsize=25,pad='20',fontweight='bold')
ax.set_ylabel('Probability Density',fontsize=18,labelpad=15)
ax.set_xlabel('text_length',fontsize=18,labelpad=15)
plt.xticks(rotation='horizontal',fontsize=14)
plt.yticks(fontsize=14)
plt.legend(fontsize=13)
ax.set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
Most of tweets have text word counts in range 25 to 60 for all categories.
#https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/
fig,ax=plt.subplots(1,3,figsize=(20,6))
ax[0].hist(pos['text_word_count'],bins=10,color='green')
ax[0].set_facecolor("tan")
ax[0].set_ylabel('positive_count',fontsize=14,labelpad=10)
ax[0].set_xlabel('text_word_count',fontsize=14,labelpad=10)
ax[1].hist(neg['text_word_count'],bins=10,color='red')
ax[1].set_facecolor("tan")
ax[1].set_ylabel('negative_count',fontsize=14,labelpad=10)
ax[1].set_xlabel('text_word_count',fontsize=14,labelpad=10)
ax[2].hist(neutral['text_word_count'],bins=10,color='yellow')
ax[2].set_facecolor("tan")
ax[2].set_ylabel('neutral_count',fontsize=14,labelpad=10)
ax[2].set_xlabel('text_word_count',fontsize=14,labelpad=10)
fig.set_facecolor("tan")
plt.show()
The histogram shows that the no. of text word counts ranges maximum upto aprox. 35 characters.
Postive tweets: Most of tweets have text word counts in range 5 to 15.
Negative tweets: Most of tweets have text word counts in range 5 to 15.
Neutral tweets: Most of tweets have text word counts in range 5 to 15.
Very less tweet of text word counts greater than 30 in all categories.
Very less tweet of text word counts less than 5 in all categories.
fig,ax=plt.subplots(figsize=(10,8))
#https://www.geeksforgeeks.org/how-to-show-mean-on-boxplot-using-seaborn-in-python/
sns.boxplot(x='sentiment',y='text_word_count',data=train,showmeans=True,ax=ax,meanprops={"marker": "+","markeredgecol
ax.set_ylabel('text_word_count',fontsize=18,labelpad=15)
ax.set_xlabel('sentiments',fontsize=18,labelpad=15)
plt.xticks(rotation='horizontal',fontsize=14)
plt.yticks(fontsize=14)
ax.set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
All the text appear to have more or less same no. of word counts.
#https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/
fig,ax=plt.subplots(figsize=(8,6))
sns.kdeplot(train[train['sentiment']=='positive']['text_word_count'],color='r',Label='positive',ax=ax)
sns.kdeplot(train[train['sentiment']=='negative']['text_word_count'],color='b',Label='negative',ax=ax)
sns.kdeplot(train[train['sentiment']=='neutral']['text_word_count'],color='g',Label='neutral',ax=ax)
ax.set_title('Tweets Distribution',loc='center',fontsize=25,pad='20',fontweight='bold')
ax.set_ylabel('Probability Density',fontsize=18,labelpad=15)
ax.set_xlabel('text_word_count',fontsize=18,labelpad=15)
plt.xticks(rotation='horizontal',fontsize=14)
plt.yticks(fontsize=14)
plt.legend(fontsize=13)
ax.set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
Most of tweets have text word counts in range 5 to 15 for all categories.
fig,ax=plt.subplots(1,3,figsize=(15,10))
stopwords = set(STOPWORDS)
wordcloud1 = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(" ".join(train[train['sentiment']=='positive']['text']))
# plot the WordCloud image
ax[0].imshow(wordcloud1)
ax[0].axis("off")
ax[0].set_title('Positive text',fontsize=20)
plt.show()
The wordclouds give an idea of the words which might influence the polarity of the tweet.
nltk.download('stopwords')
def top_n_words(corpus,n):
vec=CountVectorizer(stop_words='english').fit(corpus)
bag_of_words=vec.transform(corpus)
sum_words=bag_of_words.sum(axis=0)
words_freq=[(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq=sorted(words_freq,key=lambda x: x[1], reverse=True)
return(words_freq[:n])
d=top_n_words(neg['text'],20)
x_val=[x[0] for x in d]
y_val=[x[1] for x in d]
ax[1].barh(x_val,y_val,color ='red')
ax[1].set_title('common words for negative text',fontsize=14)
ax[1].set_xlabel('count',fontsize=14,labelpad=10)
ax[1].set_facecolor("tan")
d=top_n_words(neutral['text'],20)
x_val=[x[0] for x in d]
y_val=[x[1] for x in d]
ax[2].barh(x_val,y_val,color ='yellow')
ax[2].set_title('common words for neutral text',fontsize=14)
ax[2].set_xlabel('count',fontsize=14,labelpad=10)
ax[2].set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
d=top_n_words(neg['selected_text'],20)
x_val=[x[0] for x in d]
y_val=[x[1] for x in d]
ax[1].barh(x_val,y_val,color ='red')
ax[1].set_title('common words for negative selected text',fontsize=14)
ax[1].set_xlabel('count',fontsize=14,labelpad=10)
ax[1].set_facecolor("tan")
d=top_n_words(neutral['selected_text'],20)
x_val=[x[0] for x in d]
y_val=[x[1] for x in d]
ax[2].barh(x_val,y_val,color ='yellow')
ax[2].set_title('common words for neutral selected text',fontsize=14)
ax[2].set_xlabel('count',fontsize=14,labelpad=10)
ax[2].set_facecolor("tan")
fig.set_facecolor("tan")
plt.show()
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js