0% found this document useful (0 votes)

2 views9 pages

NLP Lab Programms

The document provides an overview of using Python's package installer, pip, to install the spaCy library and its English model for natural language processing tasks. It details various text preprocessing techniques, including tokenization, lemmatization, stemming, case folding, and calculating edit distance between strings, with code examples for each technique. The document serves as a guide for implementing these NLP techniques using spaCy and NLTK libraries.

Uploaded by

Bhargavi Chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views9 pages

NLP Lab Programms

Uploaded by

Bhargavi Chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Pip is Python's package installer, used for installing, managing, and

uninstalling software packages and their dependencies. It simplifies the

process of adding external libraries to your Python environment, making it
an essential tool for Python development.

!python -m spacy download en_core_web_sm

a small English pipeline trained on written web text (blogs, news,

comments), that includes vocabulary, syntax and entities.

1. To implement text preprocessing techniques such as

tokenization, case folding, stemming, lemmatization, and
calculate the edit distance between text strings. Sol:
TOKENIZATION

## required libraries that need to be installed

%%capture

!pip install -U spacy

!pip install -U spacy-lookups-data

!python -m spacy download en_core_web_sm

(RUN)

Explanation:

1. pip install -U spacy: This command installs or upgrades the spaCy

library to its latest version. The -U flag ensures that it updates the
package if it's already installed.

2. pip install -U spacy-lookups-data: This installs or upgrades the

spaCy lookups data, which provides lookup tables for various
features like stop words or lemmatization, and is often required for
some spaCy functionalities, especially those depending on statistical
models.

3. python -m spacy download en_core_web_sm: This downloads

the en_core_web_sm model, which is a small English pipeline that
supports core spaCy capabilities and is trained on web text. It
includes the language data and binary weights for predictions like
part-of-speech tags, dependencies, and named entities.

## tokenizing a piecen of text

doc = "I love coding and writing"

for i, w in enumerate(doc.split(" ")):

print("Token " + str(i) + ": " + w)

Explanation:

The given code snippet effectively demonstrates a simple form

of string tokenization in Python.

Here's a breakdown of the code and its functionality:

 doc = "I love coding and writing": The string variable doc is
initialized with the text "I love coding and writing".

 doc.split(" "): This method splits the string into a list of

substrings (or tokens) wherever a space (" ") is
encountered. The result will be ['I', 'love', 'coding', 'and',
'writing'].

 enumerate(doc.split(" ")): The enumerate() function iterates

over the list returned by doc.split(" "). It provides the index
(i) and the value (w) for each item in the list.

 for i, w in enumerate(doc.split(" ")): This sets up a loop that

iterates through the enumerated list of tokens. In each
iteration, i holds the index (starting from 0) and w holds the
actual token (word).

 print("Token " + str(i) + ": " + w): Inside the loop, this line
prints the token's index and the token itself in the format
"Token [index]: [token]".

O/P: Token 0: I

Token 1: love

Token 2: coding

Token 3: and

Token 4: writing

Exercise 1: Copy the code from above and add extra whitespaces
to the string value assigned to the doc variable and identify the
issue with the code. Then try to fix the issue. Hint:
Use text.strip() to fix the problem.
LEMMATIZATION

import spacy

# Load the spaCy English model

nlp = spacy.load('en_core_web_sm')

# Define a sample text

text = "The quick brown foxes are jumping over the lazy dogs."

# Process the text using spaCy

doc = nlp(text)

# Extract lemmatized tokens

lemmatized_tokens = [token.lemma_ for token in doc]

# Join the lemmatized tokens into a sentence

lemmatized_text = ' '.join(lemmatized_tokens)

# Print the original and lemmatized text

print("Original Text:", text)

print("Lemmatized Text:", lemmatized_text)

Explanation:

Code Explanation

 import spacy: This line imports the necessary spaCy library.

 nlp = spacy.load('en_core_web_sm'): This line loads the pre-trained

English language model named "en_core_web_sm". This model
includes components like a tokenizer, part-of-speech (POS) tagger,
and lemmatizer, which are essential for tasks like lemmatization.
The "sm" indicates it is a small model.

 text = "The quick brown foxes are jumping over the lazy dogs.": This
line defines the input string that will be processed.

 doc = nlp(text): This line processes the input text using the loaded
spaCy model, creating a Doc object. The Doc object represents the
processed text and contains information about tokens, their
linguistic features, and relationships, according to DataCamp.
When nlp is called on the text, it first tokenizes the text to produce
a Doc object, then processes it through several steps, including POS
tagging and lemmatization.

 lemmatized_tokens = [token.lemma_ for token in doc]: This line

iterates over each token in the Doc object and extracts its lemma
using the token.lemma_ attribute. The token.lemma_ attribute
returns the base or dictionary form of a word, also known as the
lemma. For example, the lemma of "foxes" is "fox", and the lemma
of "jumping" is "jump".

 lemmatized_text = ' '.join(lemmatized_tokens): This line joins the list

of lemmatized tokens back into a single string, separated by spaces.

 print("Original Text:", text) and print("Lemmatized Text:",

lemmatized_text): These lines print the original text and the
lemmatized text for comparison.

O/P: Original Text: The quick brown foxes are jumping over the
lazy dogs.

Lemmatized Text: the quick brown fox be jump over the

lazy dog .

Words Lemmatization:

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

# Create WordNetLemmatizer object

wnl = WordNetLemmatizer()

# single word lemmatization examples

list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling’, 'driving', 'died', 'tried',

'feet']

for words in list1:

print(words + " ---> " + wnl.lemmatize(words))

output: kites ---> kite

babies ---> baby

dogs ---> dog

flying ---> flying

smiling ---> smiling

driving ---> driving

died ---> died

tried ---> tried

feet ---> foot

Explanation:

1. import nltk and nltk.download('wordnet'): These lines import

the Natural Language Toolkit (NLTK) and download the
WordNet corpus, a lexical database essential for
the WordNetLemmatizer to function correctly.

2. from nltk.stem import WordNetLemmatizer: This imports

the WordNetLemmatizer class from the NLTK stem module.

3. wnl = WordNetLemmatizer(): An instance of

the WordNetLemmatizer is created, which will be used to
perform lemmatization.

4. list1 = [...]: This list contains words in their inflected forms

(e.g., plurals, verbs in different tenses, etc.).

5. for words in list1: print(words + " ---> " +

wnl.lemmatize(words)):

 This loop iterates through each word in list1.

 wnl.lemmatize(words): This is the core function call

that performs the lemmatization. It takes a word as
input and returns its lemma or base form.

 print(...): This line prints the original word alongside its

lemmatized form

CASE FOLDING

Sol: import spacy

# Load language model

nlp = spacy.load("en_core_web_sm")
(Run)
 import spacy: This line imports the spaCy library, making its
functionalities available for use in the current Python script. spaCy is
a free, open-source library for advanced Natural Language
Processing (NLP) in Python.

 nlp = spacy.load("en_core_web_sm"): This line loads a pre-trained

English language model named "en_core_web_sm" into a spaCy
Language object, which is conventionally named nlp.

t = "The train to London leaves at 10am on Tuesday."

doc = nlp(t)

# Case fold

print([t.lower_ for t in doc]) (Run)

The spaCy token.lower_ attribute returns the lowercase

form of the token's text. In the given example, t = "The
train to London leaves at 10am on Tuesday." and doc =
nlp(t), iterating through the doc object and
printing t.lower_ for each token t will produce the
following output:

OUTPUT: ['the', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on',
'tuesday', '.']

STEMMING

Sol:

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

doc = 'I prefer not to argue'

for token in doc.split(" "):

` print(token, '=>' , stemmer.stem(token))

Explanation:
The Python code snippet you provided uses the NLTK library's Snowball
Stemmer to reduce words to their base or root forms. The Snowball
Stemmer, also known as the Porter2 Stemmer, is an improvement upon
the original Porter Stemmer and supports multiple languages.

The provided code executes as follows:

1. Import Snowball Stemmer: The from nltk.stem.snowball import

SnowballStemmer imports the class from the NLTK library.

2. Initialize Stemmer: The code creates an instance of the Snowball

Stemmer for English using stemmer =
SnowballStemmer(language='english').

3. Define Document: The string of text to be stemmed is defined

as doc = 'I prefer not to argue'.

4. Iterate and Stem:

 The code splits the doc string into individual words (tokens)
using doc.split(" ").

 It iterates through each token and applies

the stemmer.stem(token) method to reduce each word to its
root form.

 The original word and its stemmed version are then printed.

Output:

I => i

prefer => prefer

not => not

to => to

argue => argu

calculate the edit distance between text strings.

import nltk

string1 = "CAT"

string2 = "RAT"
distance = nltk.edit_distance(string1, string2)

print(f"The Levenshtein distance between '{string1}' and

'{string2}' is: {distance}")

Explanation:
Here's how it works with your input:

 string1 = "CAT"

 string2 = "RAT"

To transform "CAT" into "RAT", substituting 'C' with 'R' is

necessary. This is one edit operation. Therefore, the
Levenshtein distance is 1.

The nltk.edit_distance() function is part of the Natural

Language Toolkit (NLTK) library in Python, a suite of libraries
for natural language processing. This function calculates the
Levenshtein distance between two strings, as demonstrated in
the code snippet.

OUTPUT: The Levenshtein distance between 'CAT' and 'RAT' is: 1

The Python code snippet utilizes the spaCy library, a popular tool
for Natural Language Processing (NLP), to perform tokenization.

Here's a breakdown of the code:

 import spacy: This line imports the necessary spaCy library.

 nlp = spacy.load("en_core_web_sm"): This line loads the English

language model, "en_core_web_sm", which is a small-sized model
used for tasks like tokenization, part-of-speech tagging, and
lemmatization. The nlp variable becomes a callable object that
processes text.

 doc = nlp("This is the so-called 'lemmatization' "): The input string,

"This is the so-called 'lemmatization'", is passed to the nlp object.
This converts the string into a Doc object, which is a processed
representation of the text. During this process, spaCy implicitly
generates token objects representing individual words or
punctuation marks.

 for token in doc: print(token.text): This loop iterates through each

token in the doc object and prints the raw text of each token using
the token.text attribute.

Output

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
7 Exp
No ratings yet
7 Exp
6 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Spacy Library
No ratings yet
Spacy Library
3 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
7 Idf
No ratings yet
7 Idf
5 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
NLP Record
No ratings yet
NLP Record
6 pages
Text Preprocessing
No ratings yet
Text Preprocessing
3 pages
NLP Exp 5, Implement Stemming, Lemmetization, Pos - Tag, Wordnet - Colab
No ratings yet
NLP Exp 5, Implement Stemming, Lemmetization, Pos - Tag, Wordnet - Colab
2 pages
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
BPMN 2.0 Symbols - A Complete Guide With Examples. - Camunda
No ratings yet
BPMN 2.0 Symbols - A Complete Guide With Examples. - Camunda
28 pages
Compiler Construction: Mohamed Zahran (Aka Z) Mzahran@cs - Nyu.edu
No ratings yet
Compiler Construction: Mohamed Zahran (Aka Z) Mzahran@cs - Nyu.edu
37 pages
Week - 1: Brief History of Python
No ratings yet
Week - 1: Brief History of Python
5 pages
CCCCCCCCCCCCC CC: C CCC CCCCCC CCC
No ratings yet
CCCCCCCCCCCCC CC: C CCC CCCCCC CCC
23 pages
Prediction Dropout or Academic Success
No ratings yet
Prediction Dropout or Academic Success
5 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Cambridge International Advanced Subsidiary and Advanced Level
No ratings yet
Cambridge International Advanced Subsidiary and Advanced Level
16 pages
Type Checking in Compiler Design
No ratings yet
Type Checking in Compiler Design
4 pages
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
No ratings yet
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
12 pages
Antlr
No ratings yet
Antlr
31 pages
Compiler
No ratings yet
Compiler
52 pages
The Ring Programming Language Version 1.4.1 Book - Part 2 of 31
No ratings yet
The Ring Programming Language Version 1.4.1 Book - Part 2 of 31
30 pages
Split Java
No ratings yet
Split Java
17 pages
DXL Reference Manual
0% (1)
DXL Reference Manual
866 pages
56f2ce3a-b7fc-47e3-b731-00fc0f7e3389
No ratings yet
56f2ce3a-b7fc-47e3-b731-00fc0f7e3389
333 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
System Software3160715 Handbook
No ratings yet
System Software3160715 Handbook
124 pages
Microprocessor and Peripherals Interfacing Notes: Course Code: ECC501 Class: TE-EXTC Mumbai University
No ratings yet
Microprocessor and Peripherals Interfacing Notes: Course Code: ECC501 Class: TE-EXTC Mumbai University
10 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
UNIT 2 Hand Written
No ratings yet
UNIT 2 Hand Written
77 pages
CSE32
No ratings yet
CSE32
27 pages
Lecture+Notes+ +PIG
No ratings yet
Lecture+Notes+ +PIG
21 pages
VHHDD
No ratings yet
VHHDD
2 pages
PES Institute of Technology-Bangalore South Campus: V Semester
No ratings yet
PES Institute of Technology-Bangalore South Campus: V Semester
26 pages
Final Year BTech CSE Structure and Syllabus
No ratings yet
Final Year BTech CSE Structure and Syllabus
32 pages
Chapter12 Spring2024
No ratings yet
Chapter12 Spring2024
53 pages
CD Notes by Quantum City AIR 107, GATE CS 2024, Shreyas Rathod Compiler
No ratings yet
CD Notes by Quantum City AIR 107, GATE CS 2024, Shreyas Rathod Compiler
37 pages
CS 3723 - Programming Language: 1. Introductory Stuff
No ratings yet
CS 3723 - Programming Language: 1. Introductory Stuff
11 pages
Compiler Design: B.Tech Cse Iii Year Ii Semester
No ratings yet
Compiler Design: B.Tech Cse Iii Year Ii Semester
25 pages
Intermediate Events in BPM
100% (1)
Intermediate Events in BPM
20 pages