0% found this document useful (0 votes)

14 views

IR Practical Code

Uploaded by

tryhackkme123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

IR Practical Code

Uploaded by

tryhackkme123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

We first compute the term frequency using this formula

Finally we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector
with tf-idf as the weight of the vector.
Multiplying the term frequency with the inverse document frequency helps offset some words
which appear more frequently in general across documents and focus on words which are
different between documents. This technique helps in finding documents that match a search
query by focussing the search on important keywords.
Document similarity program:
Our algorithm to confirm document similarity will consist of three fundamental steps:
 Split the documents in words.
 Compute the word frequencies.
 Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.

Code:

import math
import string
import sys

# reading the text file

# This functio will return a
# list of the lines of text
# in the file.
def read_file(filename):

try:
with open(filename, 'r') as f:
data = f.read()
return data

except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()

# splitting the text lines into words

# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)

# returns a list of the words

# in the file
def get_words_from_line_list(text):

text = text.translate(translation_table)
word_list = text.split()

return word_list

# counts frequency of each word

# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):

D = {}

for new_word in word_list:

if new_word in D:
D[new_word] = D[new_word] + 1

else:
D[new_word] = 1

return D

# returns dictionary of (word, frequency)

# pairs from the previous dictionary.
def word_frequencies_for_file(filename):

line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)

print("File", filename, ":", )

print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")

return freq_mapping

# returns the dot product of two documents

def dotProduct(D1, D2):
Sum = 0.0

for key in D1:

if key in D2:
Sum += (D1[key] * D2[key])

return Sum

# returns the angle in radians

# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))

return math.acos(numerator / denominator)

def documentSimilarity(filename_1, filename_2):

# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)

print("The distance between the documents is: % 0.6f (radians)"% distance)

# Driver code
documentSimilarity('GFG.txt', 'file.txt')

Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)
Practical No.2
Title: Implement Page Rank Algorithm
Objectives: To revolutionize how web pages are ranked and provided several advantages
over traditional ranking methods.
Problem Statement:
To implement Page Rank algorithm using Scikit-learn.
Software Requirement: Open Source Software like Jupiter or Spider for Python
Programming
Theory:
PageRank is an algorithm used by Google Search to rank websites in their search engine
results. PageRank is a link analysis algorithm and it assigns a numerical weighting to each
element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and references. The numerical weight that it
assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).
Other factors like Author Rank can contribute to the importance of an entity. The PageRank
algorithm outputs a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. PageRank can be calculated for
collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called
"iterations", through the collection to adjust approximate PageRank values to more closely
reflect the theoretical true value. A PageRank results from a mathematical algorithm based on
the web graph, created by all World Wide Web pages as nodes and hyperlinks as edges,
taking into consideration authority hubs. The rank value indicates an importance of a
particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is
defined recursively and depends on the number and PageRank metric of all pages that link to
it ("incoming links"). A page that is linked to by many pages with high PageRank receives a
high rank itself.
Code:
import numpy as np
from scipy.sparse import csc_matrix
def pageRank(G, s = .85, maxerr = .001):
"""
Computes the pagerank for each of the n states.
Used in webpage ranking and text summarization using unweighted
or weighted transitions respectively.
Args
----------
G: matrix representing state transitions
Gij can be a boolean or non negative real number representing
the transition weight from state i to j. Kwargs
s:probability of following a transition. 1-s probability
of teleporting to another state. Defaults to 0.85
maxerr: if the sum of pageranks between iterations is
bellow this we will have converged. Defaults to 0.001
"""
n = G.shape[0]
# transform G into markov
matrixM=csc_matrix(G,dtype=np.float)
rsums = np.array(M.sum(1))[:,0]
ri, ci = M.nonzero()
M.data /= rsums[ri]
# bool array of sink
statessink = rsums==0
# Compute pagerank r until we
convergero, r = np.zeros(n),
np.ones(n)
while np.sum(np.abs(r-ro)
>maxerrro = r.copy()
# calculate each pagerank at a time
for i in rage(0,n):
# inlinks of state i#
Ii = np.array(M[:,i].todense())[:,0]
# account for sink
states Si = sink /
float(n)
# account for teleportation to
state i Ti = np.ones(n) / float(n)
r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )
# return normalized pagerank
return r/sum(r)
if __name__=='__main__':
# Example extracted from 'Introduction to Information
Retrieval' G = np.array([[1,0,1,0,1,0,0],
[0,1,1,0,0,0,0],
[1,0,1,1,0,0,0],
[0,0,0,1,1,0,0],
[0,0,0,0,0,0,1],
[0,0,0,0,0,1,1],
[0,0,0,1,1,0,1]])
print pageRank(G,s=.86)
Output:
[0.07625602 0.03093333 0.10222139 0.20927014 0.34157375 0.03974417 0.20000121]

Conclusion:
The implementation of Page Rank algorithm was done successfully.
Disadvantages
However, there are also some disadvantages to stopword removal, including:
 The possibility of losing important information by removing words that may be
significant in a specific context.
 The subjectivity of choosing which words to include in the stopword list can affect the
results of any downstream tasks.
 The need to maintain and update the stopword list as the language and domain evolve.
 Relevant stop-word lists can be hard to find in some languages, and so may not scale
as more languages need to be processed.

Code:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stop_words(string):
stop_words = set(stopwords.words('english'))
words = string.split()
filtered_words = [word for word in words if word.lower() not in stop_words]
new_string = ' '.join(filtered_words)
return new_string

# Example usage
input_string = "This is an example sentence to remove stop words from."
result = remove_stop_words(input_string)
print("Original string:", input_string)
print("Modified string:", result)

Output:
Original string: This is an example sentence to remove stop words from.
Modified string: example sentence remove stop words from.
Step 1: Set Up the Environment

On how to build a web crawler, the first step is to ensure you have Python installed on your
system. You can download it from python.org. Additionally, you’ll need to install the required
libraries:

pip install requests beautifulsoup4

Step 2: Import Libraries

On how to build a web crawler, the next step is to create a new Python file (e.g.,
simple_crawler.py) and import the necessary libraries:

import requests from bs4 import BeautifulSoup

Step 3: Define the Crawler Function

Create a function that takes a URL as input, sends an HTTP request, and extracts relevant
information from the HTML content:

def simple_crawler(url):

# Send HTTP request to the URL

response = requests.get(url)

# Check if the request was successful (status code 200)

if response.status_code == 200:

# Parse HTML content with BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Extract and print relevant information (modify as needed)

title = soup.title.text

print(f'Title: {title}')

# Additional data extraction and processing can be added here

else:
print(f'Error: Failed to fetch {url}')

Step 4: Test the Crawler

Provide a sample URL and call the simple_crawler function to test the crawler:

if name == "main": sample_url = 'https://example.com' simple_crawler(sample_url)

Step 5: Run the Crawler

Execute the Python script in your terminal or command prompt:

python simple_crawler.py

The crawler will fetch the HTML content of the provided URL, parse it, and print the title. You
can expand the crawler by adding more functionality for extracting different types of data.

Web Crawling with Scrapy

Web crawling with Scrapy opens the door to a powerful and flexible framework designed
specifically for efficient and scalable web scraping. Scrapy simplifies the complexities of
building web crawlers, offering a structured environment for crafting spiders that can navigate
websites, extract data, and store it in a systematic manner. Here’s a closer look at web crawling
with Scrapy:

Installation:

Before you start, make sure you have Scrapy installed. You can install it using:

pip install scrapy

Creating a Scrapy Project:

Initiate a Scrapy Project:

Open a terminal and navigate to the directory where you want to create your Scrapy project. Run
the following command:

scrapy startproject your_project_name

This creates a basic project structure with the necessary files.

Define the Spider:

Inside the project directory, navigate to the spiders folder and create a Python file for your
spider. Define a spider class by subclassing scrapy.Spider and providing essential details like
name, allowed domains, and start URLs.

import scrapy

class YourSpider(scrapy.Spider):

name = 'your_spider'

allowed_domains = ['example.com']

start_urls = ['http://example.com']

def parse(self, response):

# Define parsing logic here

pass

Extracting Data:

Using Selectors:

Scrapy utilizes powerful selectors for extracting data from HTML. You can define selectors in
the spider’s parse method to capture specific elements.

def parse(self, response):

title = response.css('title::text').get()

yield {'title': title}

This example extracts the text content of the <title> tag.

Following Links:

Scrapy simplifies the process of following links. Use the follow method to navigate to other
pages.
def parse(self, response):

for next_page in response.css('a::attr(href)').getall():

yield response.follow(next_page, self.parse)

Running the Spider:

Execute your spider using the following command from the project directory:

scrapy crawl your_spider

Scrapy will initiate the spider, follow links, and execute the parsing logic defined in the parse
method.

Web crawling with Scrapy offers a robust and extensible framework for handling complex
scraping tasks. Its modular architecture and built-in features make it a preferred choice for
developers engaging in sophisticated web data extraction projects.

Code:

Output:

Conclusion:
Code:
First create a file name Emp.xml and paste the code in that file and save it.
<?xml version="1.0" encoding="UTF- 8"?>
<employee>
<fname>Ritesh</fname>
<lname>Saurabh</lname>
<home>Thane</home>
<expertise name="SQl"/>
<expertise name="Python"/>
<expertise name="Testing"/>
<expertise name="Business"/>
</employee>

Then create a new file name emp.py and save the the below code in that file and save it and run
the code.
import xml.dom.minidom
def main():
doc=xml.dom.minidom.parse("emp.xml");
print(doc.nodeName)
print(doc.firstChild.tagName)
if __name__=="__main__":
main()

Run The code

Now create a new file name emp1.py and paste the below code in that file and save it.
import xml.dom.minidom
def main():
doc = xml.dom.minidom.parse("emp.xml");
print (doc.nodeName)
print (doc.firstChild.tagName)
expertise = doc.getElementsByTagName("expertise")
print ("%d expertise:" % expertise.length)
for skill in expertise:
print (skill.getAttribute("name"))
newexpertise = doc.createElement("expertise")
newexpertise.setAttribute("name", "BigData")
doc.firstChild.appendChild(newexpertise)
print (" ")
expertise = doc.getElementsByTagName("expertise")
print ("%d expertise:" % expertise.length)
for skill in expertise:
print (skill.getAttribute("name"))
if __name__ == "__main__":
main();
Now run the code.

Longman Introductory Course For The Toefl Test The Paper Test Book With CD Rom With Answer Key Audio Cds or Audiocassettes Required PDF
No ratings yet
Longman Introductory Course For The Toefl Test The Paper Test Book With CD Rom With Answer Key Audio Cds or Audiocassettes Required PDF
4 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Python Cost Model: Docdist1
No ratings yet
Python Cost Model: Docdist1
12 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
No ratings yet
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
6 pages
ÔN TẬP FINAL NGÔN NGỮ LẬP TRÌNH
No ratings yet
ÔN TẬP FINAL NGÔN NGỮ LẬP TRÌNH
121 pages
lecture10
No ratings yet
lecture10
7 pages
IR practical B1
No ratings yet
IR practical B1
15 pages
Problem Set 3: Document Distance: Pset Buddy
No ratings yet
Problem Set 3: Document Distance: Pset Buddy
7 pages
Lab - Activity-Iii: ST ND
No ratings yet
Lab - Activity-Iii: ST ND
9 pages
Ir Practical
No ratings yet
Ir Practical
13 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Information Retrieval Journal
No ratings yet
Information Retrieval Journal
33 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Allnlp
No ratings yet
Allnlp
15 pages
Python Lab Programs
No ratings yet
Python Lab Programs
15 pages
So Lab Manual
No ratings yet
So Lab Manual
10 pages
Python
No ratings yet
Python
11 pages
"Enter A Number:": Def If Return Else Return
No ratings yet
"Enter A Number:": Def If Return Else Return
5 pages
R23 2-1 Python Lab 4 J5
No ratings yet
R23 2-1 Python Lab 4 J5
46 pages
Python Coding Queries Answered
No ratings yet
Python Coding Queries Answered
10 pages
CS Practical File
No ratings yet
CS Practical File
47 pages
Practical File by Aksh Jaiswal
No ratings yet
Practical File by Aksh Jaiswal
48 pages
Introduction to Algorithms Lecture Notes (MIT 6_006) -- It-eBooks -- It-eBooks-2017, 2017 -- IBooker It-eBooks -- Eef227987f9618b19c6d1ddd01598c23 -- Anna’s Archive
No ratings yet
Introduction to Algorithms Lecture Notes (MIT 6_006) -- It-eBooks -- It-eBooks-2017, 2017 -- IBooker It-eBooks -- Eef227987f9618b19c6d1ddd01598c23 -- Anna’s Archive
150 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
IR practical
No ratings yet
IR practical
24 pages
Batch 2
No ratings yet
Batch 2
13 pages
IR
No ratings yet
IR
12 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
akshat sethi practical file
No ratings yet
akshat sethi practical file
50 pages
6 To 10
No ratings yet
6 To 10
10 pages
collections
No ratings yet
collections
7 pages
Unit 4 Python
No ratings yet
Unit 4 Python
17 pages
Ansh Tygai Practical File
No ratings yet
Ansh Tygai Practical File
48 pages
Irs 122010304057 PDF
No ratings yet
Irs 122010304057 PDF
23 pages
Python Practice question
No ratings yet
Python Practice question
5 pages
Python Lab Manual Created
No ratings yet
Python Lab Manual Created
13 pages
PYTHON LAB MANUAL
No ratings yet
PYTHON LAB MANUAL
17 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Python Answers
No ratings yet
Python Answers
6 pages
Text File Programs Xii C
No ratings yet
Text File Programs Xii C
6 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Python Practice Exercise PDF
No ratings yet
Python Practice Exercise PDF
3 pages
Python Practice Exercise
No ratings yet
Python Practice Exercise
3 pages
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
No ratings yet
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
6 pages
Class 12 Python Programs
No ratings yet
Class 12 Python Programs
6 pages
2: Models of Computation: Al-Khw Arizm I
No ratings yet
2: Models of Computation: Al-Khw Arizm I
8 pages
IBM AI
No ratings yet
IBM AI
10 pages
Def Generate - N - Chars (A, B) : Return A B
No ratings yet
Def Generate - N - Chars (A, B) : Return A B
20 pages
Technical Interview Questions Technical Interview Questions
No ratings yet
Technical Interview Questions Technical Interview Questions
13 pages
Class 12 Cs Final Prac
No ratings yet
Class 12 Cs Final Prac
68 pages
Practical File Questions
No ratings yet
Practical File Questions
34 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
All Important Questions in Python For o Level Exam
No ratings yet
All Important Questions in Python For o Level Exam
12 pages
file handling
No ratings yet
file handling
23 pages
CAO - Processor Organization and Control Unit
No ratings yet
CAO - Processor Organization and Control Unit
120 pages
Sam Resume
No ratings yet
Sam Resume
1 page
MCS DS
No ratings yet
MCS DS
5 pages
300 Interview Questions
No ratings yet
300 Interview Questions
22 pages
Raspberry Pi As A Video Server
No ratings yet
Raspberry Pi As A Video Server
4 pages
Unite6 1 DNS Forouzan
No ratings yet
Unite6 1 DNS Forouzan
35 pages
Ifet College of Engineering Department of Cse & It Cs6303 - Computer Architecture Unit V - Memory and I/O Systems (100% THEORY) Question Bank
No ratings yet
Ifet College of Engineering Department of Cse & It Cs6303 - Computer Architecture Unit V - Memory and I/O Systems (100% THEORY) Question Bank
5 pages
MODULE 4 Network Security
No ratings yet
MODULE 4 Network Security
16 pages
Arc Geo Mini Manual
No ratings yet
Arc Geo Mini Manual
7 pages
INF30036 DataTypes Lecture2-1
No ratings yet
INF30036 DataTypes Lecture2-1
42 pages
12 Data Science Certifications That Will Pay Off
No ratings yet
12 Data Science Certifications That Will Pay Off
5 pages
User Manual For Danelec Marine VDR APT Tool
No ratings yet
User Manual For Danelec Marine VDR APT Tool
31 pages
Practical Programs Solution
No ratings yet
Practical Programs Solution
27 pages
41 Power Overview 061522 v18 Compressed
No ratings yet
41 Power Overview 061522 v18 Compressed
36 pages
Introduction To Python Worksheet
No ratings yet
Introduction To Python Worksheet
4 pages
Orona Voice Anuncer Arca II
No ratings yet
Orona Voice Anuncer Arca II
9 pages
Geovia PCBC
No ratings yet
Geovia PCBC
4 pages
Video Graphics Array
No ratings yet
Video Graphics Array
11 pages
HW SQL2
No ratings yet
HW SQL2
2 pages
Quantum Using EcoStruxure™ Control Expert Discrete and Analog I/O Reference Manual
No ratings yet
Quantum Using EcoStruxure™ Control Expert Discrete and Analog I/O Reference Manual
642 pages
DumpH13 611 ENU V11.02 PDF
100% (1)
DumpH13 611 ENU V11.02 PDF
70 pages
5.1 SICAM AK 3 Manual PDF
No ratings yet
5.1 SICAM AK 3 Manual PDF
274 pages
VP01 1 Scripting
No ratings yet
VP01 1 Scripting
30 pages
Week 02 Ch2.1 Introduction To Neural Networks
No ratings yet
Week 02 Ch2.1 Introduction To Neural Networks
44 pages
GL240 FRN152
No ratings yet
GL240 FRN152
5 pages
Exploring The Potential of IoT - An In-Depth Examination of Applications and Prospects
No ratings yet
Exploring The Potential of IoT - An In-Depth Examination of Applications and Prospects
19 pages
Design Modification & Analysis For Venturi Section of Invelox System To Maximize Power Using Multiple Wind Turbine
No ratings yet
Design Modification & Analysis For Venturi Section of Invelox System To Maximize Power Using Multiple Wind Turbine
26 pages
Condi
No ratings yet
Condi
7 pages
AR Table Information
No ratings yet
AR Table Information
7 pages