0% found this document useful (0 votes)
25 views

Pract 1 Measuring The Document Similarity in Python

Uploaded by

tryhackkme123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Pract 1 Measuring The Document Similarity in Python

Uploaded by

tryhackkme123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Title: Write a program to Compute Similarity between two text documents.

Objectives: To study Document similarity using Cosine and TF-IDF Cosine method

Problem Statement: Compute Similarity between two text documents

Software required : Open Source Software like Jupiter or Spider for Python Programming

Theory:

Document similarity, as the name suggests determines how similar are the two given
documents. By “documents”, we mean a collection of strings. For example, an essay or a .txt
file. Many organizations use this principle of document similarity to check plagiarism. It is
also used by many exams conducting institutions to check if a student cheated from the other.
Therefore, it is very important as well as interesting to know how all of this works.

Document similarity is calculated by calculating document distance. Document distance is a


concept where words(documents) are treated as vectors and is calculated as the angle between
two given document vectors. Document vectors are the frequency of occurrences of words in a
given document. Let’s see an example:
Say that we are given two documents D1 and D2 as:
D1:“This is a program”
D2: “This was a program thing”
The similar words in both these documents then become:
"This a program"
If we make a 3-D representation of this as vectors by taking D1, D2 and similar words in 3 axis
geometry, then we get:
Now if we take dot product of D1 and D2,

D1.D2 = "This"."This"+"is"."was"+"a"."a"+"program"."program"+"thing".0

D1.D2 = 1+0+1+1+0
D1.D2 = 3

Now that we know how to calculate the dot product of these documents, we can now calculate
the angle between the document vectors:

cos d = D1.D2/|D1||D2|
Here d is the document distance. It’s value ranges from 0 degree to 90 degrees. Where 0 degree
means the two documents are exactly identical and 90 degrees indicate that the two documents
are very different.
The good thing about cosine similarity is that it computes the orientation between vectors and
not the magnitude. Thus it will capture similarity between two documents that are similar despite
being different in size.

We can find Cosine Similarity with Term Frequency –Inverse Document Frequency
We first compute the term frequency using this formula

Finally we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector
with tf-idf as the weight of the vector.
Multiplying the term frequency with the inverse document frequency helps offset some words
which appear more frequently in general across documents and focus on words which are
different between documents. This technique helps in finding documents that match a search
query by focussing the search on important keywords.
Document similarity program:
Our algorithm to confirm document similarity will consist of three fundamental steps:
 Split the documents in words.
 Compute the word frequencies.
 Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.

Code:

import math
import string
import sys

# reading the text file


# This functio will return a
# list of the lines of text
# in the file.
def read_file(filename):

try:
with open(filename, 'r') as f:
data = f.read()
return data

except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()

# splitting the text lines into words


# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)

# returns a list of the words


# in the file
def get_words_from_line_list(text):

text = text.translate(translation_table)
word_list = text.split()

return word_list

# counts frequency of each word


# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):

D = {}

for new_word in word_list:

if new_word in D:
D[new_word] = D[new_word] + 1

else:
D[new_word] = 1

return D

# returns dictionary of (word, frequency)


# pairs from the previous dictionary.
def word_frequencies_for_file(filename):

line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)

print("File", filename, ":", )


print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")

return freq_mapping

# returns the dot product of two documents


def dotProduct(D1, D2):
Sum = 0.0

for key in D1:

if key in D2:
Sum += (D1[key] * D2[key])

return Sum

# returns the angle in radians


# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))

return math.acos(numerator / denominator)

def documentSimilarity(filename_1, filename_2):

# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)

print("The distance between the documents is: % 0.6f (radians)"% distance)

# Driver code
documentSimilarity('GFG.txt', 'file.txt')

Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)

You might also like