0% found this document useful (0 votes)

25 views

Pract 1 Measuring The Document Similarity in Python

Uploaded by

tryhackkme123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Pract 1 Measuring The Document Similarity in Python

Uploaded by

tryhackkme123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Title: Write a program to Compute Similarity between two text documents.

Objectives: To study Document similarity using Cosine and TF-IDF Cosine method

Problem Statement: Compute Similarity between two text documents

Software required : Open Source Software like Jupiter or Spider for Python Programming

Theory:

Document similarity, as the name suggests determines how similar are the two given
documents. By “documents”, we mean a collection of strings. For example, an essay or a .txt
file. Many organizations use this principle of document similarity to check plagiarism. It is
also used by many exams conducting institutions to check if a student cheated from the other.
Therefore, it is very important as well as interesting to know how all of this works.

Document similarity is calculated by calculating document distance. Document distance is a

concept where words(documents) are treated as vectors and is calculated as the angle between
two given document vectors. Document vectors are the frequency of occurrences of words in a
given document. Let’s see an example:
Say that we are given two documents D1 and D2 as:
D1:“This is a program”
D2: “This was a program thing”
The similar words in both these documents then become:
"This a program"
If we make a 3-D representation of this as vectors by taking D1, D2 and similar words in 3 axis
geometry, then we get:
Now if we take dot product of D1 and D2,

D1.D2 = "This"."This"+"is"."was"+"a"."a"+"program"."program"+"thing".0

D1.D2 = 1+0+1+1+0
D1.D2 = 3

Now that we know how to calculate the dot product of these documents, we can now calculate
the angle between the document vectors:

cos d = D1.D2/|D1||D2|
Here d is the document distance. It’s value ranges from 0 degree to 90 degrees. Where 0 degree
means the two documents are exactly identical and 90 degrees indicate that the two documents
are very different.
The good thing about cosine similarity is that it computes the orientation between vectors and
not the magnitude. Thus it will capture similarity between two documents that are similar despite
being different in size.

We can find Cosine Similarity with Term Frequency –Inverse Document Frequency
We first compute the term frequency using this formula

Finally we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector
with tf-idf as the weight of the vector.
Multiplying the term frequency with the inverse document frequency helps offset some words
which appear more frequently in general across documents and focus on words which are
different between documents. This technique helps in finding documents that match a search
query by focussing the search on important keywords.
Document similarity program:
Our algorithm to confirm document similarity will consist of three fundamental steps:
 Split the documents in words.
 Compute the word frequencies.
 Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.

Code:

import math
import string
import sys

# reading the text file

# This functio will return a
# list of the lines of text
# in the file.
def read_file(filename):

try:
with open(filename, 'r') as f:
data = f.read()
return data

except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()

# splitting the text lines into words

# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)

# returns a list of the words

# in the file
def get_words_from_line_list(text):

text = text.translate(translation_table)
word_list = text.split()

return word_list

# counts frequency of each word

# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):

D = {}

for new_word in word_list:

if new_word in D:
D[new_word] = D[new_word] + 1

else:
D[new_word] = 1

return D

# returns dictionary of (word, frequency)

# pairs from the previous dictionary.
def word_frequencies_for_file(filename):

line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)

print("File", filename, ":", )

print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")

return freq_mapping

# returns the dot product of two documents

def dotProduct(D1, D2):
Sum = 0.0

for key in D1:

if key in D2:
Sum += (D1[key] * D2[key])

return Sum

# returns the angle in radians

# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))

return math.acos(numerator / denominator)

def documentSimilarity(filename_1, filename_2):

# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)

print("The distance between the documents is: % 0.6f (radians)"% distance)

# Driver code
documentSimilarity('GFG.txt', 'file.txt')

Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)

GCSE Edexcel 1PH0 Physics 1.1 REVSION GUIDE WTH CHECKLIST
No ratings yet
GCSE Edexcel 1PH0 Physics 1.1 REVSION GUIDE WTH CHECKLIST
47 pages
Security: Standard Operating Procedures: Guests
No ratings yet
Security: Standard Operating Procedures: Guests
5 pages
Artificial Intelligence in Business Decision Making
No ratings yet
Artificial Intelligence in Business Decision Making
23 pages
Market Research Report
100% (1)
Market Research Report
68 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
lecture10
No ratings yet
lecture10
7 pages
2 doc
No ratings yet
2 doc
4 pages
Python Cost Model: Docdist1
No ratings yet
Python Cost Model: Docdist1
12 pages
Problem Set 3: Document Distance: Pset Buddy
No ratings yet
Problem Set 3: Document Distance: Pset Buddy
7 pages
Introduction to Algorithms Lecture Notes (MIT 6_006) -- It-eBooks -- It-eBooks-2017, 2017 -- IBooker It-eBooks -- Eef227987f9618b19c6d1ddd01598c23 -- Anna’s Archive
No ratings yet
Introduction to Algorithms Lecture Notes (MIT 6_006) -- It-eBooks -- It-eBooks-2017, 2017 -- IBooker It-eBooks -- Eef227987f9618b19c6d1ddd01598c23 -- Anna’s Archive
150 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
No ratings yet
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
6 pages
Lab - Activity-Iii: ST ND
No ratings yet
Lab - Activity-Iii: ST ND
9 pages
Class 12 Python Programs
No ratings yet
Class 12 Python Programs
6 pages
akshat sethi practical file
No ratings yet
akshat sethi practical file
50 pages
96 Yogesh Khairnar Assignment 4
No ratings yet
96 Yogesh Khairnar Assignment 4
25 pages
Practical File by Aksh Jaiswal
No ratings yet
Practical File by Aksh Jaiswal
48 pages
Text File Programs Xii C
No ratings yet
Text File Programs Xii C
6 pages
Python
No ratings yet
Python
11 pages
CS Practical File
No ratings yet
CS Practical File
47 pages
Unit 4 Python
No ratings yet
Unit 4 Python
17 pages
ex 3 text
No ratings yet
ex 3 text
2 pages
Computer Science for Digital Engineering Assignment Report
No ratings yet
Computer Science for Digital Engineering Assignment Report
15 pages
python assignment 3 AMAN GAUTAM 039
No ratings yet
python assignment 3 AMAN GAUTAM 039
5 pages
Ansh Tygai Practical File
No ratings yet
Ansh Tygai Practical File
48 pages
Computer Scinece Practical File
No ratings yet
Computer Scinece Practical File
52 pages
"Enter A Number:": Def If Return Else Return
No ratings yet
"Enter A Number:": Def If Return Else Return
5 pages
Text File (3 Mark)
No ratings yet
Text File (3 Mark)
16 pages
CS BOARDS PRACS (1)
No ratings yet
CS BOARDS PRACS (1)
13 pages
Dictionary Question -Practice Questions (1)
No ratings yet
Dictionary Question -Practice Questions (1)
6 pages
alshammari-2023-ijca-922667
No ratings yet
alshammari-2023-ijca-922667
4 pages
Python Practice question
No ratings yet
Python Practice question
5 pages
2: Models of Computation: Al-Khw Arizm I
No ratings yet
2: Models of Computation: Al-Khw Arizm I
8 pages
Python
No ratings yet
Python
13 pages
Class 12 Cs Final Prac
No ratings yet
Class 12 Cs Final Prac
68 pages
Python Experiments
No ratings yet
Python Experiments
13 pages
accenture-new-coding
No ratings yet
accenture-new-coding
6 pages
So Lab Manual
No ratings yet
So Lab Manual
10 pages
Python
No ratings yet
Python
10 pages
python (1)
No ratings yet
python (1)
10 pages
Lab Manual DSL
No ratings yet
Lab Manual DSL
26 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Text File Question Bank Solutions
No ratings yet
Text File Question Bank Solutions
14 pages
computer code practical
No ratings yet
computer code practical
17 pages
PROGRAMS
No ratings yet
PROGRAMS
4 pages
Ir Practical
No ratings yet
Ir Practical
13 pages
python
No ratings yet
python
4 pages
INFO II Practice 3 With Solutions
No ratings yet
INFO II Practice 3 With Solutions
9 pages
Experiment 8 & 9
No ratings yet
Experiment 8 & 9
14 pages
Text File Practice Questions
No ratings yet
Text File Practice Questions
3 pages
R23 2-1 Python Lab 4 J5
No ratings yet
R23 2-1 Python Lab 4 J5
46 pages
1a. Best of Two
No ratings yet
1a. Best of Two
7 pages
Program Ms
No ratings yet
Program Ms
99 pages
Python
No ratings yet
Python
5 pages
File Handling Question
No ratings yet
File Handling Question
3 pages
2021 Uam 2107
No ratings yet
2021 Uam 2107
8 pages
kashish cs (3)
No ratings yet
kashish cs (3)
27 pages
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
No ratings yet
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
6 pages
Jobanpy
No ratings yet
Jobanpy
47 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Plc Additional Programs
No ratings yet
Plc Additional Programs
5 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python Answers
No ratings yet
Python Answers
6 pages
Module 1
No ratings yet
Module 1
7 pages
Math10 Q3 Module10 Probalityofcompoundevents Lessonproper v2
No ratings yet
Math10 Q3 Module10 Probalityofcompoundevents Lessonproper v2
82 pages
CSEC Mathematics June 2016 P2 28pgs
No ratings yet
CSEC Mathematics June 2016 P2 28pgs
28 pages
Thesis Informed Consent
100% (3)
Thesis Informed Consent
5 pages
Goyat Project Report 2024 Final
No ratings yet
Goyat Project Report 2024 Final
63 pages
Notice For Enrollment Cum Exam.h.s 20230001
No ratings yet
Notice For Enrollment Cum Exam.h.s 20230001
2 pages
Argala When Placed in The 4th
No ratings yet
Argala When Placed in The 4th
8 pages
Class 17 Ingles para La Comunicacion
No ratings yet
Class 17 Ingles para La Comunicacion
38 pages
2025 Baseline Test GRADE 12 MATHEMATICS_
No ratings yet
2025 Baseline Test GRADE 12 MATHEMATICS_
8 pages
Inhibition of Trypsin Activity in Vitro by Phytate: Oilseeds (Cosgrove, 1966)
No ratings yet
Inhibition of Trypsin Activity in Vitro by Phytate: Oilseeds (Cosgrove, 1966)
2 pages
Aman 2018 Decolonising Intercultural Education - Colonial Differences, The Geopolitics of Knowledge, and Inter-Epistemic Dialogue
No ratings yet
Aman 2018 Decolonising Intercultural Education - Colonial Differences, The Geopolitics of Knowledge, and Inter-Epistemic Dialogue
116 pages
Flat Mid II Obj 2024
No ratings yet
Flat Mid II Obj 2024
2 pages
Soil Testing Geotechnical Investigations
No ratings yet
Soil Testing Geotechnical Investigations
21 pages
Midterm Exam
No ratings yet
Midterm Exam
4 pages
K To 12 English Competencies Grade 1 3
No ratings yet
K To 12 English Competencies Grade 1 3
11 pages
Copernicus-Enabled Assessment of The Impact of War On Ukrainian Agriculture White Paper
No ratings yet
Copernicus-Enabled Assessment of The Impact of War On Ukrainian Agriculture White Paper
32 pages
Finnegan NASTool
No ratings yet
Finnegan NASTool
1 page
Rytsas - Hello
No ratings yet
Rytsas - Hello
4 pages
Woman's Art Inc. Woman's Art Journal: This Content Downloaded From 52.31.199.201 On Mon, 13 May 2019 10:23:03 UTC
No ratings yet
Woman's Art Inc. Woman's Art Journal: This Content Downloaded From 52.31.199.201 On Mon, 13 May 2019 10:23:03 UTC
8 pages
1.case Report-A Rare Case of Inguinal Canal Malignancy
No ratings yet
1.case Report-A Rare Case of Inguinal Canal Malignancy
4 pages
Master Task SCM Fusion
No ratings yet
Master Task SCM Fusion
30 pages
The Canton Group
No ratings yet
The Canton Group
2 pages
Air_FB2612BD08C6
No ratings yet
Air_FB2612BD08C6
2 pages
DLP
No ratings yet
DLP
4 pages
Conditional Propositions
No ratings yet
Conditional Propositions
29 pages
Dyson Ball Up13 Operating Instructions
100% (1)
Dyson Ball Up13 Operating Instructions
16 pages