0% found this document useful (0 votes)
32 views

Tweet-Sentiment-Extraction - Data Preprocessing

The document discusses preprocessing text data for sentiment analysis. It imports necessary libraries, loads training and test datasets from a Google Drive folder, and cleans the text data by removing URLs, punctuation, numbers, and special characters. It also drops rows where the text or selected text columns are empty. A function is defined to find spelling errors by comparing words in the text and selected text columns.

Uploaded by

Dipanshu Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Tweet-Sentiment-Extraction - Data Preprocessing

The document discusses preprocessing text data for sentiment analysis. It imports necessary libraries, loads training and test datasets from a Google Drive folder, and cleans the text data by removing URLs, punctuation, numbers, and special characters. It also drops rows where the text or selected text columns are empty. A function is defined to find spelling errors by comparing words in the text and selected text columns.

Uploaded by

Dipanshu Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

pip

install fuzzywuzzy

Collecting fuzzywuzzy
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0

import pandas as pd
import pickle
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
import string
import re
import nltk
# load in all the modules we're going to need
import nltk,re,string,collections
import tensorflow as tf
from fuzzywuzzy import fuzz

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

train=pd.read_csv('/content/drive//My Drive/Tweet Sentiment Extraction/train.csv')


test=pd.read_csv('/content/drive//My Drive/Tweet Sentiment Extraction/test.csv')

train.head()

textID text selected_text sentiment

0 cb774db0d1 I`d have responded, if I were going I`d have responded, if I were going neutral

1 549e992a42 Sooo SAD I will miss you here in San Diego!!! Sooo SAD negative

2 088c60f138 my boss is bullying me... bullying me negative

3 9642c003ef what interview! leave me alone leave me alone negative

4 358bd9e861 Sons of ****, why couldn`t they put them on t... Sons of ****, negative

test.head()

textID text sentiment

0 f87dea47db Last session of the day http://twitpic.com/67ezh neutral

1 96d74cb729 Shanghai is also really exciting (precisely -... positive

2 eee518ae67 Recession hit Veronique Branquinho, she has to... negative

3 01082688c6 happy bday! positive

4 33987a8ee5 http://twitpic.com/4w75p - I like it!! positive

print("train shape :",train.shape)


print("test shape :",test.shape)

train shape : (27481, 4)


test shape : (3534, 3)

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 textID 27481 non-null object
1 text 27480 non-null object
2 selected_text 27480 non-null object
3 sentiment 27481 non-null object
dtypes: object(4)
memory usage: 858.9+ KB

test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3534 entries, 0 to 3533
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 textID 3534 non-null object
1 text 3534 non-null object
2 sentiment 3534 non-null object
dtypes: object(3)
memory usage: 83.0+ KB

train.dropna(inplace=True)

train.shape

(27480, 4)

#https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
train['text']= train['text'].apply(lambda x: x.lower())
test['text']= test['text'].apply(lambda x: x.lower())
train['selected_text']= train['selected_text'].apply(lambda x: x.lower())

#https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
def remove_hyperlinks(text):
hyperlinkfree=re.sub('https?://\S+|www\.\S+', '', text)
return hyperlinkfree
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove_hyperlinks(x))
test['text']=test['text'].apply(lambda x:remove_hyperlinks(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove_hyperlinks(x))

#https://stackoverflow.com/questions/12851791/removing-numbers-from-string
#https://stackoverflow.com/questions/18082130/python-regex-to-remove-all-words-which-contains-number
def remove(text):
text=re.sub('\S*\d\S*',' ',text) #Removing Numbers
text=re.sub('<.*?>+',' ',text) #Removing Angular Brackets
text=re.sub('\[.*?\]',' ',text) #Removing Square Brackets
text=re.sub('\n',' ',text) #Removing '\n' character
text=re.sub('\*+','<ABUSE>',text) #Replacing **** by ABUSE word

return text
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove(x))
test['text']=test['text'].apply(lambda x:remove(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove(x))

#https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
train['text']=train['text'].apply(lambda x:remove_punctuation(x))
test['text']=test['text'].apply(lambda x:remove_punctuation(x))
train['selected_text']=train['selected_text'].apply(lambda x:remove_punctuation(x))

train.head()

textID text selected_text sentiment

0 cb774db0d1 id have responded if i were going id have responded if i were going neutral

1 549e992a42 sooo sad i will miss you here in san diego sooo sad negative

2 088c60f138 my boss is bullying me bullying me negative

3 9642c003ef what interview leave me alone leave me alone negative

4 358bd9e861 sons of ABUSE why couldnt they put them on th... sons of ABUSE negative

test.head()

textID text sentiment

0 f87dea47db last session of the day neutral

1 96d74cb729 shanghai is also really exciting precisely s... positive

2 eee518ae67 recession hit veronique branquinho she has to ... negative

3 01082688c6 happy bday positive

4 33987a8ee5 i like it positive

train[train["text"]==' ']
textID text selected_text sentiment

1319 bc84f21e3b shoesshoesshoesyayyayyayloli positive

24926 0872ed0f00 neutral

train[train["selected_text"]==' ']

textID text selected_text sentiment

9533 c6149b7abf its castiel positive

10997 6d85945e2c hussein i have to wake up earlier than i thou... negative

13728 c7b78c1b26 btw and i ordered some of yer merch yester... positive

24348 a21d9c38a8 ill keep working on her lol good idea positive

25455 b6bf74a5c8 europe sounds will finish my exam on teus a... positive

25637 370880f242 its realy my book is on the side im not stu... negative

test[test["text"]==' ']

textID text sentiment

train.drop(train[train["text"]==' '].index,inplace=True)

train.drop(train[train["selected_text"]==' '].index,inplace=True)

train.head()

textID text selected_text sentiment

0 cb774db0d1 id have responded if i were going id have responded if i were going neutral

1 549e992a42 sooo sad i will miss you here in san diego sooo sad negative

2 088c60f138 my boss is bullying me bullying me negative

3 9642c003ef what interview leave me alone leave me alone negative

4 358bd9e861 sons of ABUSE why couldnt they put them on th... sons of ABUSE negative

def wrong_words(text,selected):
words=[]
text=text.split()
selected=selected.split()
for i in selected:
if i not in text:
words.append(i)
if len(words)>0:
return " ".join(words)
else:
return '++++'

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train[train['spelling'].apply(lambda x: len(x))==1]

textID text selected_text sentiment spelling

49 3fcea4debc which case i got a new one last week and im n... d im not thrilled at all with mine negative d

66 95e12b1cb1 hes awesome have you worked with him before ... s awesome positive s

129 94f67cfa6d hey mia totally adore your music when will ... y adore positive y

134 6903cb08f2 nice to see you tweeting its sunday may an... e nice positive e

166 c78bf59e67 lichfield tweetup sounds like fun hope to s... p sounds like fun positive p

... ... ... ... ... ...

27153 a044ed928d enjoy nola definitely one of my favorite citi... y one of my favorite cities in the world positive y

27240 40143b692e who knows it makes me sad lol e sad negative e

27426 132e051fe8 my cousins moved there like years ago and ... m sad negative m

27470 778184dff1 lol i know and hahadid you fall asleep or ju... t bored negative t

27476 4eac33d1c0 wish we could come see u on denver husband l... d lost negative d

680 rows × 5 columns


def remove_text(x):
selected=x[0]
spelling=x[1]
selected=selected.split()
selected.remove(spelling) #https://www.geeksforgeeks.org/python-list-remove/
return " ".join(selected)

train['selected_text']=train[['selected_text','spelling']].apply(lambda x: remove_text(x) if len(x['spelling'])==

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train[train['spelling'].apply(lambda x: len(x))==1]

textID text selected_text sentiment spelling

train.loc[(train['spelling']!='++++') & (train['sentiment']=='neutral')]

textID text selected_text sentiment spelling

92 a3de81e1ba hi how are you doing ABUSEjust joined twit... hi how are you doing ABUSEjust joined twitter neutral twitter

powerblog what is this powerblog challenge g what is this powerblog challenge you keep
251 77ba0fee75 neutral g followe
y... ta...

366 b751f39570 yea i should know but tell me everything ps ... yea i should know but tell me everything ps s... neutral ha

581 1fce3e2d7b no bueno hollykins needs to feel better asap... no bueno hollykins needs to feel better asap ... neutral soproudofyo

637 b04bbb81c9 hi to one kiwi artist from another kiwi artist hi to one kiwi artist from another kiwi artis neutral artis

... ... ... ... ... ...

26870 ff77427519 will do it in a couple od days when i have ... will do it in a couple od days when i have ... neutral pe

dxd you crazy little thing why didn�t you


26882 336b9cfc93 xdxdxd you crazy little thing why didn�t ... neutral dxd
ge...

jamie sean cody up for some angry ABUSE amie sean cody up for some angry ABUSE
27067 c2dfc5875c neutral amie
jamie... jamie ...

27111 44cf670176 haha itll be gross by the time it comes back ... haha itll be gross by the time it comes back ... neutral india

27349 f60f20ed09 yea but thats an old pic she looks a lot dif... yea but thats an old pic she looks a lot diff... neutral lo

149 rows × 5 columns

train['selected_text']=train.apply(lambda x: x['text']
if ( (x['spelling']!='++++') & (x['sentiment']=='neutral') ) else x['selected_text'],

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train.loc[(train['spelling']!='++++') & (train['sentiment']=='neutral')].head()

textID text selected_text sentiment spelling

train.loc[(train['spelling']!='++++') & (train['sentiment']=='positive')]

textID text selected_text sentiment spelling

39 2863f435bd a little happy for the wine jeje ok itsm my fr... a little happy fo positive fo

149 11c10e3fc5 shes unassuming and unpretentious shes just a... endearing positive endearing

168 e56eeb4968 few bevvies in twngreat on a day off great positive great

240 7df7221145 thats why i need to be thereto represent the... thats why i need to be there positive there

309 a54d3c2825 i know it was worth a shot though as wort positive as wort

... ... ... ... ... ...

27339 27829d441f just got out of the pool so funnow gonna wat... so fun positive fun

27376 13fbc75291 going to church in the morninghappy mommas day... happy positive happy

27386 e149ebd3a1 would one of the vwllers want to add this ev... ch appreciat positive ch appreciat

27401 261e064dd4 oh silence verona i am wanting to go jaja ... ja enjoyyitverymu positive ja enjoyyitverymu

27474 8f14bb2715 so i get up early and i feel good about the da... i feel good ab positive ab

547 rows × 5 columns

train.loc[(train['spelling']!='++++') & (train['sentiment']=='negative')]


textID text selected_text sentiment spelling

18 af3fed7fc3 is back home now gonna miss every one onna negative onna

27 bdc32ea43c on the way to malaysiano internet access to twit no internet negative no

well so much for being unhappy for about


32 1c31703aef if it is any consolation i got my bmi tested ... negative minute
minute

48 3d9d4b0b55 i donbt like to peel prawns i also dont like g... dont like go negative go

60 52483f7da8 i lost all my friends im alone and sleepyi wan... i lost all my friends im alone and sleepy negative sleepy

... ... ... ... ... ...

27209 30a1e8c2b4 guys i know my ability to read time telli... es faile negative es faile

what happened are you suffering from


27280 e77145c41c i cant move eithe negative eithe
necksho...

27302 90c8aa60db have i ever told you i absolutly hate writing ... hate wr negative wr

27362 7b82d63ee4 just found out i wont be tweeting from my phon... c sorr negative c sorr

27456 d32efe060f i wanna leave work already not feelin it wanna leave work al negative al

520 rows × 5 columns

FuzzyWuzzy Python library


Using fuzzywuzzy library where we can have a score out of 100, that denotes two string are equal by giving similarity index.

https://www.geeksforgeeks.org/fuzzywuzzy-python-library/

fuzz.ratio('geeksforgeeks','geeksgeeks')

87

fuzz.ratio('GeeksforGeeks','GeeksforGeeks')

100

def matching(x):
text=x[0]
selected=x[1]
spelling=x[2]
text=text.split()
selected=selected.split()
spelling=spelling.split()
for s in spelling:
for t in text:
if s in selected:
if(fuzz.ratio(t,s)>55):
index=selected.index(s)
selected[index]=t
return " ".join(selected)

train['selected_text']=train[['text','selected_text','spelling']].apply(lambda x: matching(x) if x['spelling']!=

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train.loc[(train['spelling']!='++++') & (train['sentiment']=='positive')]


textID text selected_text sentiment spelling

349 322b61740c degrees gross skies and thunderstormsperfect... perfect match positive perfect

362 b94aaf845e please review sunehre ad placement please re positive re

1077 3d5c1ed21b up is out i didnt get the memo it looks am... o it looks amazing positive o

1363 4eec486ad7 hey i loved acs but i had to see it online is... y i loved acs but i had to see it online is no... positive y

1447 2b6bd43b14 oh ok well good for youcan i get some weather... l good for youcan positive l

... ... ... ... ... ...

24169 cc58093a1b i know our cats could be family mikesh is so... sh is so cute positive sh

24594 fcb4695f92 have you check oceanupmiley cyrus justin gast... mim with you whatever happen positive mim

25059 43c9e2ebbf i hope you hoes are having so much funnot to... h funnot positive h

25908 fdd49e735a ok time for bed good night twitter good night tw positive tw

26883 ee30bf8953 im yours jason mrazlooking for an electric gu... yay positive yay

90 rows × 5 columns

train.loc[(train['spelling']!='++++') & (train['sentiment']=='negative')]

textID text selected_text sentiment spelling

27 bdc32ea43c on the way to malaysiano internet access to twit no internet negative no

223 242d92151a walking to class i hate not having a bikeespec... i hate not having a bike negative bike

569 03f9f6f798 i dont think ive ever been so tierd in my life... i dont think ive ever been so tierd in my lifeu negative lifeu

604 15f47296e2 soooim kinda o sick n tired of the bs that guy... im kinda o sick n tired negative im

863 19d585c61b my poor heather she didnt make the cheerleadin... sorry ba negative ba

... ... ... ... ... ...

26248 7d4f718542 gates am also guttedthe end is nigh x the end is nigh negative the

27209 30a1e8c2b4 guys i know my ability to read time telli... es failed negative es

27302 90c8aa60db have i ever told you i absolutly hate writing ... hate wr negative wr

27362 7b82d63ee4 just found out i wont be tweeting from my phon... c sorry negative c

27456 d32efe060f i wanna leave work already not feelin it wanna leave work al negative al

93 rows × 5 columns

train[train['spelling'].apply(lambda x: len(x))==1]
textID text selected_text sentiment spelling

1077 3d5c1ed21b up is out i didnt get the memo it looks am... o it looks amazing positive o

1363 4eec486ad7 hey i loved acs but i had to see it online is... y i loved acs but i had to see it online is no... positive y

1447 2b6bd43b14 oh ok well good for youcan i get some weather... l good for youcan positive l

3720 4034294a60 followfriday follow these ppl they are in... l they are interesting doesnt tweet much they positive l

4764 43e6d9aeaa well i guess they think of everything thank... g think positive g

6939 a50c8dc573 o its feels like a hot box and no matter wher... s like a hot box a negative s

6948 7b25c09b0f im sure lots of that studio equipment was col... y collected negative y

9063 3e775363a1 we cant even call you from belgium sucks m sucks negative m

9539 825b22b853 wait and electrik red or richgirl im a suck... l im a sucker for the wait negative l

11431 e158424933 nope san leandro marina how are you hope y... u hope youre well positive u

13445 af0ac6b470 presentation went well yes i also met a buch... h of cool people checked your portfolio nice w... positive h

14571 7abcaab000 im out of town next week well have to party... k happy positive k

17627 e040d9f166 what did i do to you sheesh u sheesh negative u

21495 0d0959690c were you able to watch it online i hope you ... h belinda jensen was really good positive h

22928 3c68f6963c lol on blipfm is not the on twitter i hate ... r i hate that on negative r

23415 0dac65e546 arnold california aka the best place everwis... a the best place everwish positive a

24002 c62590abcd shiv his place slowly i hope y i hope positive y

25059 43c9e2ebbf i hope you hoes are having so much funnot to... h funnot positive h

26241 2dff366260 going to shawneei hate the long drive therei j... i hate the long drive the negative i

27362 7b82d63ee4 just found out i wont be tweeting from my phon... c sorry negative c

train['selected_text']=train[['selected_text','spelling']].apply(lambda x: remove_text(x) if len(x['spelling'])==

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train[train['spelling'].apply(lambda x: len(x))==1]

textID text selected_text sentiment spelling

train.loc[(train['spelling']!='++++') & (train['sentiment']=='positive')]

textID text selected_text sentiment spelling

349 322b61740c degrees gross skies and thunderstormsperfect... perfect match positive perfect

362 b94aaf845e please review sunehre ad placement please re positive re

1531 a0ee798944 waiting to go to bed had a great weekend great we positive we

1580 8a159382ea just woke up gonna have a shower and go to nan... happy mo positive mo

1588 a7f72a928a woooooooooo are you coming to nottingham at... to lovelovelove positive lovelovelove

... ... ... ... ... ...

24123 4591bce14e think you should catch up on your sleep befor... haha goodnight positive goodnight

24169 cc58093a1b i know our cats could be family mikesh is so... sh is so cute positive sh

24594 fcb4695f92 have you check oceanupmiley cyrus justin gast... mim with you whatever happen positive mim

25908 fdd49e735a ok time for bed good night twitter good night tw positive tw

26883 ee30bf8953 im yours jason mrazlooking for an electric gu... yay positive yay

78 rows × 5 columns

train.loc[(train['spelling']!='++++') & (train['sentiment']=='negative')]


textID text selected_text sentiment spelling

27 bdc32ea43c on the way to malaysiano internet access to twit no internet negative no

223 242d92151a walking to class i hate not having a bikeespec... i hate not having a bike negative bike

569 03f9f6f798 i dont think ive ever been so tierd in my life... i dont think ive ever been so tierd in my lifeu negative lifeu

604 15f47296e2 soooim kinda o sick n tired of the bs that guy... im kinda o sick n tired negative im

863 19d585c61b my poor heather she didnt make the cheerleadin... sorry ba negative ba

... ... ... ... ... ...

26088 69a24b165f taking willy to the specialistpoor dog he has ... poor dog he negative poor

26248 7d4f718542 gates am also guttedthe end is nigh x the end is nigh negative the

27209 30a1e8c2b4 guys i know my ability to read time telli... es failed negative es

27302 90c8aa60db have i ever told you i absolutly hate writing ... hate wr negative wr

27456 d32efe060f i wanna leave work already not feelin it wanna leave work al negative al

85 rows × 5 columns

def matching(x):
text=x[0]
selected=x[1]
spelling=x[2]
text=text.split()
selected=selected.split()
spelling=spelling.split()
for s in spelling:
for t in text:
if s in selected:
if(fuzz.ratio(t,s)>35):
index=selected.index(s)
selected[index]=t
return " ".join(selected)

train['selected_text']=train[['text','selected_text','spelling']].apply(lambda x: matching(x) if x['spelling']!=

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train.loc[(train['spelling']!='++++') & (train['sentiment']=='positive')]

textID text selected_text sentiment spelling

1588 a7f72a928a woooooooooo are you coming to nottingham at... to lovelovelove positive lovelovelove

7410 3463ecdfd6 imintheroom imwatchingthehannahmoviewithmomshe... great positive great

10521 f29edbc282 dora the explorer greetings to your niece enjoy positive enjoy

train.loc[7410].text

'imintheroom imwatchingthehannahmoviewithmomshesaidthisfilmverygreat'

train.loc[7410].text='im in the room im watching the hannah movie with mom she said this film very great'

train.loc[1588].text

' woooooooooo are you coming to nottingham at any point '

train.loc[1588].selected_text='woooooooooo'

train.loc[10521].text

' dora the explorer greetings to your niece'

train.loc[10521].selected_text='greetings'

train.loc[(train['spelling']!='++++') & (train['sentiment']=='negative')]


textID text selected_text sentiment spelling

gonna do laundrynever did laundry a hotel


2398 983dfecd25 did miss you r negative r
bef...

these dogs are going to die if somebody aam these dogs are going to die if somebody
6113 2cb67e64b4 negative aam
doe... do...

9817 3358792fc9 following and followers nice not nice negative not

13637 d83fd6c942 tweeets fgs tweekdeckkk hates me cryyyy kk hates me cryyyy negative kk

just got back fromahem boring but had to


14839 b19376c3bd was boring but had to eat nonetheless negative was
eat...

16201 e78c1ad3f5 off to work off at lammmeeee negative lammmeeee

grreverytime he gets a new girlfriendim at


25293 2fdbe40c03 im at the bottom of the totem pole negative im
the...

train.loc[2398].text

'gonna do laundrynever did laundry a hotel beforei miss you reven though you ignore me and don even check on m
e'

train.loc[2398].selected_text='did miss you'

train.loc[6113].text

' these dogs are going to die if somebody doesnt save them'

train.loc[6113].selected_text='these dogs are going to die if somebody doesnt save them'

train.loc[9817].text

' following and followers nice'

train.loc[9817].text='following and followers not nice'

train.loc[13637].text

' tweeets fgs tweekdeckkk hates me cryyyy'

train.loc[13637].selected_text='hates me cryyyy'

train.loc[14839].text

'just got back fromahem boring but had to eat nonetheless'

train.loc[14839].selected_text='boring but had to eat nonetheless'

train.loc[16201].text

'off to work off at '

train.loc[16201].selected_text='off to work'

train.loc[25293].text

'grreverytime he gets a new girlfriendim at the bottom of the totem pole'

train.loc[25293].selected_text='at the bottom of the totem pole'

train['spelling']=train.apply(lambda x: wrong_words(x.text,x.selected_text),axis=1)

train[train['spelling'].apply(lambda x: len(x))==1]

textID text selected_text sentiment spelling

train.loc[(train['spelling']!='++++') & (train['sentiment']=='positive')]

textID text selected_text sentiment spelling

train.loc[(train['spelling']!='++++') & (train['sentiment']=='negative')]

textID text selected_text sentiment spelling

train.reset_index(inplace=True)
train.drop(['index'],inplace=True,axis=1)

train

textID text selected_text sentiment spelling

0 cb774db0d1 id have responded if i were going id have responded if i were going neutral ++++

1 549e992a42 sooo sad i will miss you here in san diego sooo sad negative ++++

2 088c60f138 my boss is bullying me bullying me negative ++++

3 9642c003ef what interview leave me alone leave me alone negative ++++

4 358bd9e861 sons of ABUSE why couldnt they put them on th... sons of ABUSE negative ++++

... ... ... ... ... ...

27467 4eac33d1c0 wish we could come see u on denver husband l... lost negative ++++

27468 4f4c4fc327 ive wondered about rake to the client has ma... dont force negative ++++

27469 f67aae2310 yay good for both of you enjoy the break you... yay good for both of you positive ++++

27470 ed167662a5 but it was worth it ABUSE but it was worth it ABUSE positive ++++

27471 6f7127d9d7 all this flirting going on the atg smiles ... all this flirting going on the atg smiles yay... neutral ++++

27472 rows × 5 columns

train.loc[8727]

textID 12f21c8f19
text star wars is ABUSE boo i wanna do your job h...
selected_text
sentiment positive
spelling ++++
Name: 8727, dtype: object

train.loc[25996]

textID 0b3fe0ca78
text
selected_text
sentiment neutral
spelling ++++
Name: 25996, dtype: object

train.drop(8727,inplace=True)

train.drop(25996,inplace=True)

train.reset_index(inplace=True)

train

index textID text selected_text sentiment spelling

0 0 cb774db0d1 id have responded if i were going id have responded if i were going neutral ++++

1 1 549e992a42 sooo sad i will miss you here in san diego sooo sad negative ++++

2 2 088c60f138 my boss is bullying me bullying me negative ++++

3 3 9642c003ef what interview leave me alone leave me alone negative ++++

sons of ABUSE why couldnt they put them on


4 4 358bd9e861 sons of ABUSE negative ++++
th...

... ... ... ... ... ... ...

wish we could come see u on denver husband


27465 27467 4eac33d1c0 lost negative ++++
l...

27466 27468 4f4c4fc327 ive wondered about rake to the client has ma... dont force negative ++++

27467 27469 f67aae2310 yay good for both of you enjoy the break you... yay good for both of you positive ++++

27468 27470 ed167662a5 but it was worth it ABUSE but it was worth it ABUSE positive ++++

all this flirting going on the atg smiles


27469 27471 6f7127d9d7 all this flirting going on the atg smiles ... neutral ++++
yay...

27470 rows × 6 columns

train.drop(['index'],inplace=True,axis=1)

train
train

textID text selected_text sentiment spelling

0 cb774db0d1 id have responded if i were going id have responded if i were going neutral ++++

1 549e992a42 sooo sad i will miss you here in san diego sooo sad negative ++++

2 088c60f138 my boss is bullying me bullying me negative ++++

3 9642c003ef what interview leave me alone leave me alone negative ++++

4 358bd9e861 sons of ABUSE why couldnt they put them on th... sons of ABUSE negative ++++

... ... ... ... ... ...

27465 4eac33d1c0 wish we could come see u on denver husband l... lost negative ++++

27466 4f4c4fc327 ive wondered about rake to the client has ma... dont force negative ++++

27467 f67aae2310 yay good for both of you enjoy the break you... yay good for both of you positive ++++

27468 ed167662a5 but it was worth it ABUSE but it was worth it ABUSE positive ++++

27469 6f7127d9d7 all this flirting going on the atg smiles ... all this flirting going on the atg smiles yay... neutral ++++

27470 rows × 5 columns

with open('/content/drive//My Drive/Tweet Sentiment Extraction/preprocessed_train.pkl','wb') as f:


pickle.dump(train,f)

with open('/content/drive//My Drive/Tweet Sentiment Extraction/preprocessed_test.pkl','wb') as f:


pickle.dump(test,f)

with open('/content/drive//My Drive/Tweet Sentiment Extraction/preprocessed_train.pkl','rb') as f:


train=pickle.load(f)

#https://www.w3schools.com/python/ref_string_index.asp
def start_index(x):
text=x[0]
selected=x[1]
text=text.split()
selected=selected.split()
word=selected[0]
index=text.index(word)
return index

train['start_index']=train[['text','selected_text']].apply(lambda x: start_index(x),axis=1)

#https://www.w3schools.com/python/ref_string_index.asp
def end_index(x):
text=x[0]
selected=x[1]
start_index=x[2]
text=text.split()
selected= selected.split()
word=selected[-1]
try:
index=text.index(word,start_index)
except:
index=text.index(word)
return index

train['end_index']=train[['text','selected_text','start_index']].apply(lambda x: end_index(x),axis=1)

train
textID text selected_text sentiment spelling start_index end_index

id have responded if i were


0 cb774db0d1 id have responded if i were going neutral ++++ 0 6
going

sooo sad i will miss you here in san


1 549e992a42 sooo sad negative ++++ 0 1
diego

2 088c60f138 my boss is bullying me bullying me negative ++++ 3 4

3 9642c003ef what interview leave me alone leave me alone negative ++++ 2 4

sons of ABUSE why couldnt they put


4 358bd9e861 sons of ABUSE negative ++++ 0 2
them on th...

... ... ... ... ... ... ... ...

wish we could come see u on


27465 4eac33d1c0 lost negative ++++ 9 9
denver husband l...

ive wondered about rake to the client


27466 4f4c4fc327 dont force negative ++++ 13 14
has ma...

yay good for both of you enjoy the


27467 f67aae2310 yay good for both of you positive ++++ 0 5
break you...

27468 ed167662a5 but it was worth it ABUSE but it was worth it ABUSE positive ++++ 0 5

all this flirting going on the atg smiles all this flirting going on the atg
27469 6f7127d9d7 neutral ++++ 0 9
... smiles yay...

27470 rows × 7 columns

train[train.start_index > train.end_index]

textID text selected_text sentiment spelling start_index end_index

yeah im okay been icing and ace


708 48096285e5 lol and positive ++++ 15 5
bandage and ...

packing up and leaving inlaws house


1104 5419aaf31e nice sweet positive ++++ 12 8
heading ho...

i just started watching rock too


1279 adac9ee2e1 so too positive ++++ 12 5
borrowed s...

hey i loved acs but i had to see it i loved acs but i had to see it
1361 4eec486ad7 positive ++++ 1 0
online is... online is not ...

that would mean me babe but


1374 eadba04b8a i cool positive ++++ 14 13
ABUSE it my name...

... ... ... ... ... ... ... ...

gutted i miss that the one night i try


25276 2e3afc2394 were the negative ++++ 14 4
lea...

happy hug your mom day love you


25997 6417912691 love your positive ++++ 5 2
mom

i know i shouldnt be saying this but


26150 e137ff5246 but ABUSE i negative ++++ 7 0
ABUSE iti...

waiting for my momma so i can go to


26347 aa3b4edd5e miss what negative ++++ 21 12
chase and ...

oh the noon i dont know if i can i dont know if i can make that
26522 2fd0de39b2 negative ++++ 3 2
make that on... noon

89 rows × 7 columns

train=train[train.start_index <= train.end_index]

train[train.start_index > train.end_index]

textID text selected_text sentiment spelling start_index end_index

train.reset_index(inplace=True)

train.drop(['index'],inplace=True,axis=1)

train.shape

(27381, 7)

train
textID text selected_text sentiment spelling start_index end_index

id have responded if i were


0 cb774db0d1 id have responded if i were going neutral ++++ 0 6
going

sooo sad i will miss you here in san


1 549e992a42 sooo sad negative ++++ 0 1
diego

2 088c60f138 my boss is bullying me bullying me negative ++++ 3 4

3 9642c003ef what interview leave me alone leave me alone negative ++++ 2 4

sons of ABUSE why couldnt they put


4 358bd9e861 sons of ABUSE negative ++++ 0 2
them on th...

... ... ... ... ... ... ... ...

wish we could come see u on


27376 4eac33d1c0 lost negative ++++ 9 9
denver husband l...

ive wondered about rake to the client


27377 4f4c4fc327 dont force negative ++++ 13 14
has ma...

yay good for both of you enjoy the


27378 f67aae2310 yay good for both of you positive ++++ 0 5
break you...

27379 ed167662a5 but it was worth it ABUSE but it was worth it ABUSE positive ++++ 0 5

all this flirting going on the atg smiles all this flirting going on the atg
27380 6f7127d9d7 neutral ++++ 0 9
... smiles yay...

27381 rows × 7 columns

with open('/content/drive//My Drive/Tweet Sentiment Extraction/updated_train.pkl','wb') as f:


pickle.dump(train,f)
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like