Pract 1 Measuring The Document Similarity in Python
Pract 1 Measuring The Document Similarity in Python
Objectives: To study Document similarity using Cosine and TF-IDF Cosine method
Software required : Open Source Software like Jupiter or Spider for Python Programming
Theory:
Document similarity, as the name suggests determines how similar are the two given
documents. By “documents”, we mean a collection of strings. For example, an essay or a .txt
file. Many organizations use this principle of document similarity to check plagiarism. It is
also used by many exams conducting institutions to check if a student cheated from the other.
Therefore, it is very important as well as interesting to know how all of this works.
D1.D2 = "This"."This"+"is"."was"+"a"."a"+"program"."program"+"thing".0
D1.D2 = 1+0+1+1+0
D1.D2 = 3
Now that we know how to calculate the dot product of these documents, we can now calculate
the angle between the document vectors:
cos d = D1.D2/|D1||D2|
Here d is the document distance. It’s value ranges from 0 degree to 90 degrees. Where 0 degree
means the two documents are exactly identical and 90 degrees indicate that the two documents
are very different.
The good thing about cosine similarity is that it computes the orientation between vectors and
not the magnitude. Thus it will capture similarity between two documents that are similar despite
being different in size.
We can find Cosine Similarity with Term Frequency –Inverse Document Frequency
We first compute the term frequency using this formula
Finally we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector
with tf-idf as the weight of the vector.
Multiplying the term frequency with the inverse document frequency helps offset some words
which appear more frequently in general across documents and focus on words which are
different between documents. This technique helps in finding documents that match a search
query by focussing the search on important keywords.
Document similarity program:
Our algorithm to confirm document similarity will consist of three fundamental steps:
Split the documents in words.
Compute the word frequencies.
Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.
Code:
import math
import string
import sys
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
text = text.translate(translation_table)
word_list = text.split()
return word_list
D = {}
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
return freq_mapping
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
# Driver code
documentSimilarity('GFG.txt', 'file.txt')
Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)