0% found this document useful (0 votes)

46 views3 pages

Python Assignment 3

This document discusses text preprocessing and feature extraction techniques in natural language processing. It loads a corpus of text documents, applies bag-of-words and TF-IDF algorithms to extract features and calculate feature weights. Specifically, it uses sklearn's CountVectorizer and TfidfVectorizer to transform text into numerical vectors and calculate IDF weights. It then prints the extracted features and computed IDF values. Additionally, it implements a custom IDF calculation for comparison purposes.

Uploaded by

Bataan Shivani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views3 pages

Python Assignment 3

Uploaded by

Bataan Shivani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

In [68]: #TASK 1

#import all required libraray import pandas as pd import math import numpy
as np from scipy import sparse from scipy.stats import uniform from
sklearn.feature_extraction.text import TfidfVectorizer

# input data string corpus = ['this is the first document',

'this document is the second document', 'and this is the third one', 'is
this the first document'] # use fit method to compute Bag of words

vectorizer = TfidfVectorizer() vectorizer.fit(corpus) skl_output =

vectorizer.transform(corpus) bow=vectorizer.get_feature_names() print(bow)
IDF_reference=vectorizer.idf_ print(IDF_reference) #compute IDF using
custom method

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(corpus)

matrix = CountVectorizer() matrix.fit(corpus) # after this statement the

matrix will build the vocabulary with all th e unique words

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
# you should call this function only after fit()

# to convert the sentance into numerical vectors, we will call transfor

m() # the first feature name will corresponds to first column in
transforme d matrix # the 2nd feature name will corresponds to 2nd column
in transformed ma trix

print(matrix.transform(corpus).toarray())

# Here we will print the sklearn tfidf vectorizer idf values after appl
ying the fit method # After using the fit function on the corpus the vocab
has 9 words in i t, and each has its idf value.

#compute IDF using custom method

for i in range(len(bow)):
Y=0 word=bow[i] for j in range(len(corpus)):
list[j]=corpus[j].split()

if(word in list[j]): #print(word) #print(list[j]) Y=Y+1 X=len(corpus)

XY=math.log((1+X)/(1+Y)) IDF_custom=XY+1 print(IDF_custom)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'th

is'] [1 91629073 1 22314355 1 51082562 1 1 91629073 1 91629073 Create PDF

in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

[1.91629073 1.22314355 1.51082562 1. 1.91629073 1.91629073
1. 1.91629073 1. ] [[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1
1 1] [0 1 1 1 0 0 1 0 1]] 1.916290731874155 1.2231435513142097
1.5108256237659907 1.0 1.916290731874155 1.916290731874155 1.0
1.916290731874155 1.0

In [15]: #TASK2
import pickle import numpy as np with
open("E:\Applied_AI\Assignments\cleaned_strings","rb") as f:
data = pickle.load(f) # printing the length of the corpus loaded
print("Number of documents in data = ",len(data))

#call all usique words using fit and tranform function from
sklearn.feature_extraction.text import TfidfVectorizer vectorizer =
TfidfVectorizer() vectorizer.fit(data) skl_output =
vectorizer.transform(data) bow=vectorizer.get_feature_names()

#compute IDF IDF=vectorizer.idf_

#sort IDF in descending order sorted_IDF=np.sort(IDF)

required_IDF=sorted_IDF[::-1]

#print top 50 IDF values

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
print(required_IDF[0:49])

Number of documents in data = 746 [6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918
6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918]

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
Foundations of Python for AI
No ratings yet
Foundations of Python for AI
67 pages
Assignment 3 Instructions
No ratings yet
Assignment 3 Instructions
10 pages
Assign 3
No ratings yet
Assign 3
1 page
NLP Assignment 4 (22bce9560)
No ratings yet
NLP Assignment 4 (22bce9560)
12 pages
Nlp2.ipynb - Colab
No ratings yet
Nlp2.ipynb - Colab
3 pages
EX1
No ratings yet
EX1
6 pages
ML - Lab Manual With Woad File
No ratings yet
ML - Lab Manual With Woad File
12 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
Module III
No ratings yet
Module III
42 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
No ratings yet
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
79 pages
Assignment - 7: Import Import Import Import
No ratings yet
Assignment - 7: Import Import Import Import
3 pages
Report On - Social Media Research Topic Modeling
No ratings yet
Report On - Social Media Research Topic Modeling
26 pages
IR Practical
No ratings yet
IR Practical
24 pages
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
No ratings yet
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
8 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
NLP Record 2
No ratings yet
NLP Record 2
18 pages
Apply SVM To Amazon Reviews Data Set Avg W2vec (M)
No ratings yet
Apply SVM To Amazon Reviews Data Set Avg W2vec (M)
8 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Dsbda 7
No ratings yet
Dsbda 7
1 page
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
DS 7
No ratings yet
DS 7
3 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
NLP2 Prasen
No ratings yet
NLP2 Prasen
6 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
TP1 NLP
No ratings yet
TP1 NLP
7 pages
Code
No ratings yet
Code
6 pages
21BEE0103 (Iot 2theory)
No ratings yet
21BEE0103 (Iot 2theory)
7 pages
Allnlp
No ratings yet
Allnlp
15 pages
Prompt
No ratings yet
Prompt
29 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
Import Import As Import As Import: 'Ignore'
No ratings yet
Import Import As Import As Import: 'Ignore'
4 pages
Using Transfer Learning and TensorFlow 2.0 To Classify Different Dog Breeds
No ratings yet
Using Transfer Learning and TensorFlow 2.0 To Classify Different Dog Breeds
16 pages
Sample
No ratings yet
Sample
6 pages
Cyberbullying Code
No ratings yet
Cyberbullying Code
6 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Lab 6
No ratings yet
Lab 6
47 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
Assignment
No ratings yet
Assignment
6 pages
2403RES29 - Hemant Choudhary - CS582 - Assignment - 1
No ratings yet
2403RES29 - Hemant Choudhary - CS582 - Assignment - 1
5 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Random Forest
No ratings yet
Random Forest
5 pages
Code Text
No ratings yet
Code Text
4 pages
Gen Ai Lab Programs
No ratings yet
Gen Ai Lab Programs
15 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Gena I Short
No ratings yet
Gena I Short
6 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Seminar Presentation On Mobile VR
No ratings yet
Seminar Presentation On Mobile VR
14 pages
(X) Rust Backend vs. Go Backend in Web Development
No ratings yet
(X) Rust Backend vs. Go Backend in Web Development
7 pages
Files in Folders and Subfolders
No ratings yet
Files in Folders and Subfolders
6 pages
Vcs db2 Agent 802 Aix
No ratings yet
Vcs db2 Agent 802 Aix
76 pages
3 5 7 Semester BTech MTech Odd Regular Supplementary End Semester Exam Timetable Oct 2023 - Draft
No ratings yet
3 5 7 Semester BTech MTech Odd Regular Supplementary End Semester Exam Timetable Oct 2023 - Draft
9 pages
pd100 PDF
No ratings yet
pd100 PDF
88 pages
AC10 4.Xx Firmware Update Process
No ratings yet
AC10 4.Xx Firmware Update Process
5 pages
Computer Projrct.
No ratings yet
Computer Projrct.
18 pages
Built A Ping Pong Game Using C#
No ratings yet
Built A Ping Pong Game Using C#
21 pages
CHAPTER 3 - 4-Flags and Data Processing Instructions - 3
No ratings yet
CHAPTER 3 - 4-Flags and Data Processing Instructions - 3
85 pages
Splunk-6 0 3-DistSearch
No ratings yet
Splunk-6 0 3-DistSearch
49 pages
IO System
No ratings yet
IO System
32 pages
70-347.examcollection - Premium.exam.177q: 70-347 Enabling Office 365 Services Sections
No ratings yet
70-347.examcollection - Premium.exam.177q: 70-347 Enabling Office 365 Services Sections
117 pages
FSBUS Stepping Motor Controller
No ratings yet
FSBUS Stepping Motor Controller
16 pages
Lecs 102
No ratings yet
Lecs 102
20 pages
Trees Col 106: Acknowledgement:Many Slides Are Courtesy Douglas Harder, Uwaterloo
No ratings yet
Trees Col 106: Acknowledgement:Many Slides Are Courtesy Douglas Harder, Uwaterloo
59 pages
DC Power Monitor With INA209 and Arduino
100% (2)
DC Power Monitor With INA209 and Arduino
57 pages
1 s2.0 S0140366421003613 Main
No ratings yet
1 s2.0 S0140366421003613 Main
10 pages
Icons: Cisco Products
No ratings yet
Icons: Cisco Products
12 pages
Blue Pumpkin (Database+Source Code+Lib+Excel)
No ratings yet
Blue Pumpkin (Database+Source Code+Lib+Excel)
25 pages
Exp 06 Random Number Generator For Gaming Using D - Flipflop
No ratings yet
Exp 06 Random Number Generator For Gaming Using D - Flipflop
12 pages
Dell Premium DES-1423 by - VCEplus 68q-DEMO
100% (1)
Dell Premium DES-1423 by - VCEplus 68q-DEMO
22 pages
Oracle - Actualtests.1z0 448.v2018!11!26.by - Colin.49q
No ratings yet
Oracle - Actualtests.1z0 448.v2018!11!26.by - Colin.49q
22 pages
PDFsam Merge
No ratings yet
PDFsam Merge
25 pages
Introduction To Python
No ratings yet
Introduction To Python
5 pages
Conecting Hardware Peripherals LO 4 1
No ratings yet
Conecting Hardware Peripherals LO 4 1
9 pages
WebSphere V8.5 Configuration - DMZ
No ratings yet
WebSphere V8.5 Configuration - DMZ
21 pages
PraposalFinal MIC
No ratings yet
PraposalFinal MIC
4 pages
Compal Confidential: EA50 - HWS M/B Schematics Document Intel Shark Bay SV (Haswell+ Lynx Point)
No ratings yet
Compal Confidential: EA50 - HWS M/B Schematics Document Intel Shark Bay SV (Haswell+ Lynx Point)
56 pages
TY COMP Sem VI MAD 22617 MCQs Bank
No ratings yet
TY COMP Sem VI MAD 22617 MCQs Bank
20 pages

Python Assignment 3

Uploaded by

Python Assignment 3

Uploaded by

In [68]: #TASK 1

# input data string corpus = ['this is the first document',

vectorizer = TfidfVectorizer() vectorizer.fit(corpus) skl_output =

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(corpus)

matrix = CountVectorizer() matrix.fit(corpus) # after this statement the

# to convert the sentance into numerical vectors, we will call transfor

#compute IDF using custom method

if(word in list[j]): #print(word) #print(list[j]) Y=Y+1 X=len(corpus)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'th

is'] [1 91629073 1 22314355 1 51082562 1 1 91629073 1 91629073 Create PDF

in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

#compute IDF IDF=vectorizer.idf_

#sort IDF in descending order sorted_IDF=np.sort(IDF)

#print top 50 IDF values

Number of documents in data = 746 [6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918]

You might also like