Section 6 - Jupyter Notebook
Section 6 - Jupyter Notebook
Section 6 - Jupyter Notebook
Out[13]:
review sentiment
In [14]: df.info()
#يقدم معلومات عن الdata frame
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 2000 non-null object
1 sentiment 2000 non-null object
dtypes: object(2)
memory usage: 31.4+ KB
In [15]: df.describe().T
#يقدم احصائية عن البيانات الموجودة
#العدد اإلجمالي للقيم في العمود.
#المتوسط الحسابي للعمر
#االنحراف المعياري للعمر
#القيمة الدنيا للعمر
#القيمة القصوى للعمر
#الربع األول والثاني والثالث للعمر
Out[15]:
count unique top freq
review 2000 2000 One of the other reviewers has mentioned that ... 1
Out[16]: sentiment
positive 1005
negative 995
Name: count, dtype: int64
In [10]: df['review'].str.len().hist()
#يقوم برسم هيستوغرام لطول التقييمات(عدد األحرف) في العمود
Out[18]:
text sentiment
1996 THE CELL (2000) Rating: 8/10<br /><br />The Ce... positive
1998 I loved this movie! It was all I could do not ... positive
1999 This was the worst movie I have ever seen Bill... negative
#removing emoji:
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
# removing short form:
text=re.sub("isn't",'is not',text)
text=re.sub("he's",'he is',text)
text=re.sub("wasn't",'was not',text)
text=re.sub("there's",'there is',text)
text=re.sub("couldn't",'could not',text)
text=re.sub("won't",'will not',text)
text=re.sub("they're",'they are',text)
text=re.sub("she's",'she is',text)
text=re.sub("There's",'there is',text)
text=re.sub("wouldn't",'would not',text)
text=re.sub("haven't",'have not',text)
text=re.sub("That's",'That is',text)
text=re.sub("you've",'you have',text)
text=re.sub("He's",'He is',text)
text=re.sub("what's",'what is',text)
text=re.sub("weren't",'were not',text)
text=re.sub("we're",'we are',text)
text=re.sub("hasn't",'has not',text)
text=re.sub("you'd",'you would',text)
text=re.sub("shouldn't",'should not',text)
text=re.sub("let's",'let us',text)
text=re.sub("they've",'they have',text)
text=re.sub("You'll",'You will',text)
text=re.sub("i'm",'i am',text)
text=re.sub("we've",'we have',text)
text=re.sub("it's",'it is',text)
text=re.sub("don't",'do not',text)
text=re.sub("that´s",'that is',text)
text=re.sub("I´m",'I am',text)
text=re.sub("it’s",'it is',text)
text=re.sub("she´s",'she is',text)
text=re.sub("he’s'",'he is',text)
text=re.sub('I’m','I am',text)
text=re.sub('I’d','I did',text)
text=re.sub("he’s'",'he is',text)
text=re.sub('there’s','there is',text)
return text
dt = df['text'].apply(cleaning)
In [16]: df['sentiment']
#" يستخدم السترجاع العمود المسمىsentiment" منDataFrame
# سواء كانت إيجابية أو سلبية،العمود يحتوي على معلومات المشاعر المرتبطة بالتقييمات
Out[16]: 0 positive
1 positive
2 positive
3 negative
4 positive
...
1995 negative
1996 positive
1997 negative
1998 positive
1999 negative
Name: sentiment, Length: 2000, dtype: object
In [20]: dt = pd.DataFrame(dt)
dt['sentiment']=df['sentiment']
dt
Out[20]:
text sentiment
1996 the cell rating the cell like antz must be wa... positive
1997 this movie despite its list of b c and d list ... negative
1998 i loved this movie it was all i could do not t... positive
1999 this was the worst movie i have ever seen bill... negative
In [19]: dt
Out[19]:
text sentiment no_sw
0 one of the other reviewers has mentioned that ... positive reviewers mentioned watching oz episode youll ...
1 a wonderful little production the filming tech... positive wonderful production filming technique unassum...
2 i thought this was a wonderful way to spend ti... positive wonderful spend time hot summer weekend sittin...
3 basically theres a family where a little boy j... negative basically family boy jake thinks zombie closet...
4 petter matteis love in the time of money is a ... positive petter matteis love time money visually stunni...
1995 feeling minnesota directed by steven baigelman... negative feeling minnesota directed steven baigelmann s...
1996 the cell rating the cell like antz must be wa... positive cell rating cell antz watched appreciated time...
1997 this movie despite its list of b c and d list ... negative movie despite list b list celebs complete wast...
1998 i loved this movie it was all i could do not t... positive loved movie break tears watching uplifting str...
1999 this was the worst movie i have ever seen bill... negative worst movie billy zane understand movie showca...
Out[22]:
word count
0 movie 3376
1 film 2889
2 story 862
3 time 826
4 movies 667
5 great 661
6 made 616
7 make 600
8 characters 586
9 films 566
In [23]: px.bar(temp, x="count", y="word", title='Commmon Words in Text', orientation='h',
width=700, height=700)
# سيتم رسم شريطي يعرض الكلمات األكثر تكراًرا في النصوص المنظفة،بعد تنفيذ هذا الكود،
#حيث يتم عرض عدد تكرار كل كلمة على المحور األفقي والكلمة نفسها على المحور العمودي.
films
characters
make
made
great
word
movies
time
story
film
movie
count
Out[24]:
text sentiment no_sw wo_stopfreq
one of the other reviewers has mentioned that reviewers mentioned watching oz episode youll
0 positive reviewers mentioned watching oz episode youll ...
... ...
wonderful spend time hot summer weekend wonderful spend hot summer weekend sitting
2 i thought this was a wonderful way to spend ti... positive
sittin... air...
3 basically theres a family where a little boy j... negative basically family boy jake thinks zombie closet... basically family boy jake thinks zombie closet...
4 petter matteis love in the time of money is a ... positive petter matteis love time money visually stunni... petter matteis love money visually stunning wa...
In [25]: dt['no_sw'].loc[5]
# في العمود5 " هذا الكود ُيظهر النص المنظف من الكلمات الوقفية للصف رقمno_sw" منDataFrame dt
Out[25]: 'probably alltime favorite movie story selflessness sacrifice dedication noble preachy boring despite times years pau
l lukas performance brings tears eyes bette davis sympathetic roles delight kids grandma dressedup midgets children m
akes fun watch mothers slow awakening whats happening world roof believable startling dozen thumbs theyd movie'
In [26]: dt['wo_stopfreq'].loc[5]
# في العمود5 " هذا الكود ُيظهر النص الذي تم إزالة منه الكلمات الشائعة والوقفية للصف رقمwo_stopfreq" منDataFrame dt.
Out[26]: 'probably alltime favorite selflessness sacrifice dedication noble preachy boring despite times years paul lukas perf
ormance brings tears eyes bette davis sympathetic roles delight kids grandma dressedup midgets children makes fun wat
ch mothers slow awakening whats happening world roof believable startling dozen thumbs theyd'
In [27]: # Lemmatization: Lemmatization is converting the word to its base form or lemma by removing affixes from the inflected
# It helps to create better features for machine learning and NLP models hence it is an important preprocessing step.
wordnet_lem = WordNetLemmatizer()
dt['wo_stopfreq_lem'] = dt['wo_stopfreq'].apply(wordnet_lem.lemmatize)
dt
#wordnet_lem = WordNetLemmatizer() يقوم بتهيئة محركLemmatization.
#dt['wo_stopfreq'].apply(wordnet_lem.lemmatize) يطبق عملية
#Lemmatization " على كل نص في العمودwo_stopfreq" منDataFrame dt.
# "wo_stopfreq_lem" فيDataFrame dt.
#النتائج يتم تخزينها في عمود جديد باسم
# Lemmatization عملية
#مفيدة لتحسين معالم النصوص وتقليل التباين بين الكلمات المشابهة مما يساعد في تحسين أداء نماذج التعلم اآللي
#وموديالت معالجة اللغة الطبيعية
Out[27]:
text sentiment no_sw wo_stopfreq wo_stopfreq_lem
one of the other reviewers has reviewers mentioned watching oz reviewers mentioned watching oz reviewers mentioned watching oz
0 positive
mentioned that ... episode youll ... episode youll ... episode youll ...
a wonderful little production the wonderful production filming wonderful production filming wonderful production filming
1 positive
filming tech... technique unassum... technique unassum... technique unassum...
i thought this was a wonderful wonderful spend time hot summer wonderful spend hot summer wonderful spend hot summer
2 positive
way to spend ti... weekend sittin... weekend sitting air... weekend sitting air...
basically theres a family where a basically family boy jake thinks basically family boy jake thinks basically family boy jake thinks
3 negative
little boy j... zombie closet... zombie closet... zombie closet...
petter matteis love in the time of petter matteis love time money petter matteis love money visually petter matteis love money visually
4 positive
money is a ... visually stunni... stunning wa... stunning wa...
feeling minnesota directed by feeling minnesota directed steven feeling minnesota directed steven feeling minnesota directed steven
1995 negative
steven baigelman... baigelmann s... baigelmann s... baigelmann s...
the cell rating the cell like antz cell rating cell antz watched cell rating cell antz watched cell rating cell antz watched
1996 positive
must be wa... appreciated time... appreciated medi... appreciated medi...
this movie despite its list of b c movie despite list b list celebs despite list b list celebs complete despite list b list celebs complete
1997 negative
and d list ... complete wast... waste minu... waste minu...
i loved this movie it was all i could loved movie break tears watching loved break tears watching loved break tears watching uplifting
1998 positive
do not t... uplifting str... uplifting struck pe... struck pe...
this was the worst movie i have worst movie billy zane understand worst billy zane understand worst billy zane understand
1999 negative
ever seen bill... movie showca... showcase comers pr... showcase comers pr...
Out[28]:
sentiment review
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 12
8 cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
9 # يقوم بإنشاء مثيل لـCountVectorizer مع تحديد بعض الخصائص مثل:
10 #stop_words='english': يقوم بإزالة الكلمات الوقفية اإلنجليزية
11 #يستخدم مفردات التقسيم التي تم تعريفها مسبًقا لتقسيم النصوص
---> 12 text_counts = cv.fit_transform(nb['review'])
aiofiles-22.1.0-py311haa95532_0
aiosqlite-0.18.0-py311haa95532_0
async-timeout-4.0.2-py311haa95532_0
backcall-0.2.0-pyhd3eb1b0_0
bottleneck-1.3.5-py311h5bb9823_0
brotlipy-0.7.0-py311h2bbff1b_1002
datasets-2.12.0-py311haa95532_0
datashape-0.5.4-py311haa95532_1
heapdict-1.0.1-pyhd3eb1b0 0
In [3]: from sklearn.metrics import plot_confusion_matrix
import warnings
warnings.filterwarnings("ignore")
k= [CNB]
for i in k:
plot_confusion_matrix(i, X_test, y_test)
plt.title(i)
plt.show()
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from sklearn.metrics import plot_confusion_matrix
2 import warnings
3 warnings.filterwarnings("ignore")
In [ ]:
In [ ]:
In [ ]: