IR Practical Code
IR Practical Code
Finally we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector
with tf-idf as the weight of the vector.
Multiplying the term frequency with the inverse document frequency helps offset some words
which appear more frequently in general across documents and focus on words which are
different between documents. This technique helps in finding documents that match a search
query by focussing the search on important keywords.
Document similarity program:
Our algorithm to confirm document similarity will consist of three fundamental steps:
Split the documents in words.
Compute the word frequencies.
Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.
Code:
import math
import string
import sys
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
text = text.translate(translation_table)
word_list = text.split()
return word_list
D = {}
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
return freq_mapping
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
# Driver code
documentSimilarity('GFG.txt', 'file.txt')
Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)
Practical No.2
Title: Implement Page Rank Algorithm
Objectives: To revolutionize how web pages are ranked and provided several advantages
over traditional ranking methods.
Problem Statement:
To implement Page Rank algorithm using Scikit-learn.
Software Requirement: Open Source Software like Jupiter or Spider for Python
Programming
Theory:
PageRank is an algorithm used by Google Search to rank websites in their search engine
results. PageRank is a link analysis algorithm and it assigns a numerical weighting to each
element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and references. The numerical weight that it
assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).
Other factors like Author Rank can contribute to the importance of an entity. The PageRank
algorithm outputs a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. PageRank can be calculated for
collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called
"iterations", through the collection to adjust approximate PageRank values to more closely
reflect the theoretical true value. A PageRank results from a mathematical algorithm based on
the web graph, created by all World Wide Web pages as nodes and hyperlinks as edges,
taking into consideration authority hubs. The rank value indicates an importance of a
particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is
defined recursively and depends on the number and PageRank metric of all pages that link to
it ("incoming links"). A page that is linked to by many pages with high PageRank receives a
high rank itself.
Code:
import numpy as np
from scipy.sparse import csc_matrix
def pageRank(G, s = .85, maxerr = .001):
"""
Computes the pagerank for each of the n states.
Used in webpage ranking and text summarization using unweighted
or weighted transitions respectively.
Args
----------
G: matrix representing state transitions
Gij can be a boolean or non negative real number representing
the transition weight from state i to j. Kwargs
s:probability of following a transition. 1-s probability
of teleporting to another state. Defaults to 0.85
maxerr: if the sum of pageranks between iterations is
bellow this we will have converged. Defaults to 0.001
"""
n = G.shape[0]
# transform G into markov
matrixM=csc_matrix(G,dtype=np.float)
rsums = np.array(M.sum(1))[:,0]
ri, ci = M.nonzero()
M.data /= rsums[ri]
# bool array of sink
statessink = rsums==0
# Compute pagerank r until we
convergero, r = np.zeros(n),
np.ones(n)
while np.sum(np.abs(r-ro)
>maxerrro = r.copy()
# calculate each pagerank at a time
for i in rage(0,n):
# inlinks of state i#
Ii = np.array(M[:,i].todense())[:,0]
# account for sink
states Si = sink /
float(n)
# account for teleportation to
state i Ti = np.ones(n) / float(n)
r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )
# return normalized pagerank
return r/sum(r)
if __name__=='__main__':
# Example extracted from 'Introduction to Information
Retrieval' G = np.array([[1,0,1,0,1,0,0],
[0,1,1,0,0,0,0],
[1,0,1,1,0,0,0],
[0,0,0,1,1,0,0],
[0,0,0,0,0,0,1],
[0,0,0,0,0,1,1],
[0,0,0,1,1,0,1]])
print pageRank(G,s=.86)
Output:
[0.07625602 0.03093333 0.10222139 0.20927014 0.34157375 0.03974417 0.20000121]
Conclusion:
The implementation of Page Rank algorithm was done successfully.
Disadvantages
However, there are also some disadvantages to stopword removal, including:
The possibility of losing important information by removing words that may be
significant in a specific context.
The subjectivity of choosing which words to include in the stopword list can affect the
results of any downstream tasks.
The need to maintain and update the stopword list as the language and domain evolve.
Relevant stop-word lists can be hard to find in some languages, and so may not scale
as more languages need to be processed.
Code:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def remove_stop_words(string):
stop_words = set(stopwords.words('english'))
words = string.split()
filtered_words = [word for word in words if word.lower() not in stop_words]
new_string = ' '.join(filtered_words)
return new_string
# Example usage
input_string = "This is an example sentence to remove stop words from."
result = remove_stop_words(input_string)
print("Original string:", input_string)
print("Modified string:", result)
Output:
Original string: This is an example sentence to remove stop words from.
Modified string: example sentence remove stop words from.
Step 1: Set Up the Environment
On how to build a web crawler, the first step is to ensure you have Python installed on your
system. You can download it from python.org. Additionally, you’ll need to install the required
libraries:
On how to build a web crawler, the next step is to create a new Python file (e.g.,
simple_crawler.py) and import the necessary libraries:
Create a function that takes a URL as input, sends an HTTP request, and extracts relevant
information from the HTML content:
def simple_crawler(url):
response = requests.get(url)
if response.status_code == 200:
title = soup.title.text
print(f'Title: {title}')
else:
print(f'Error: Failed to fetch {url}')
Provide a sample URL and call the simple_crawler function to test the crawler:
python simple_crawler.py
The crawler will fetch the HTML content of the provided URL, parse it, and print the title. You
can expand the crawler by adding more functionality for extracting different types of data.
Web crawling with Scrapy opens the door to a powerful and flexible framework designed
specifically for efficient and scalable web scraping. Scrapy simplifies the complexities of
building web crawlers, offering a structured environment for crafting spiders that can navigate
websites, extract data, and store it in a systematic manner. Here’s a closer look at web crawling
with Scrapy:
Installation:
Before you start, make sure you have Scrapy installed. You can install it using:
Open a terminal and navigate to the directory where you want to create your Scrapy project. Run
the following command:
Inside the project directory, navigate to the spiders folder and create a Python file for your
spider. Define a spider class by subclassing scrapy.Spider and providing essential details like
name, allowed domains, and start URLs.
import scrapy
class YourSpider(scrapy.Spider):
name = 'your_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
pass
Extracting Data:
Using Selectors:
Scrapy utilizes powerful selectors for extracting data from HTML. You can define selectors in
the spider’s parse method to capture specific elements.
title = response.css('title::text').get()
Following Links:
Scrapy simplifies the process of following links. Use the follow method to navigate to other
pages.
def parse(self, response):
Execute your spider using the following command from the project directory:
Scrapy will initiate the spider, follow links, and execute the parsing logic defined in the parse
method.
Web crawling with Scrapy offers a robust and extensible framework for handling complex
scraping tasks. Its modular architecture and built-in features make it a preferred choice for
developers engaging in sophisticated web data extraction projects.
Code:
Output:
Conclusion:
Code:
First create a file name Emp.xml and paste the code in that file and save it.
<?xml version="1.0" encoding="UTF- 8"?>
<employee>
<fname>Ritesh</fname>
<lname>Saurabh</lname>
<home>Thane</home>
<expertise name="SQl"/>
<expertise name="Python"/>
<expertise name="Testing"/>
<expertise name="Business"/>
</employee>
Then create a new file name emp.py and save the the below code in that file and save it and run
the code.
import xml.dom.minidom
def main():
doc=xml.dom.minidom.parse("emp.xml");
print(doc.nodeName)
print(doc.firstChild.tagName)
if __name__=="__main__":
main()
Now create a new file name emp1.py and paste the below code in that file and save it.
import xml.dom.minidom
def main():
doc = xml.dom.minidom.parse("emp.xml");
print (doc.nodeName)
print (doc.firstChild.tagName)
expertise = doc.getElementsByTagName("expertise")
print ("%d expertise:" % expertise.length)
for skill in expertise:
print (skill.getAttribute("name"))
newexpertise = doc.createElement("expertise")
newexpertise.setAttribute("name", "BigData")
doc.firstChild.appendChild(newexpertise)
print (" ")
expertise = doc.getElementsByTagName("expertise")
print ("%d expertise:" % expertise.length)
for skill in expertise:
print (skill.getAttribute("name"))
if __name__ == "__main__":
main();
Now run the code.