0% found this document useful (0 votes)
2 views9 pages

NLP Lab Programms

The document provides an overview of using Python's package installer, pip, to install the spaCy library and its English model for natural language processing tasks. It details various text preprocessing techniques, including tokenization, lemmatization, stemming, case folding, and calculating edit distance between strings, with code examples for each technique. The document serves as a guide for implementing these NLP techniques using spaCy and NLTK libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

NLP Lab Programms

The document provides an overview of using Python's package installer, pip, to install the spaCy library and its English model for natural language processing tasks. It details various text preprocessing techniques, including tokenization, lemmatization, stemming, case folding, and calculating edit distance between strings, with code examples for each technique. The document serves as a guide for implementing these NLP techniques using spaCy and NLTK libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Pip is Python's package installer, used for installing, managing, and

uninstalling software packages and their dependencies. It simplifies the


process of adding external libraries to your Python environment, making it
an essential tool for Python development.

!python -m spacy download en_core_web_sm

a small English pipeline trained on written web text (blogs, news,


comments), that includes vocabulary, syntax and entities.

1. To implement text preprocessing techniques such as


tokenization, case folding, stemming, lemmatization, and
calculate the edit distance between text strings. Sol:
TOKENIZATION

## required libraries that need to be installed

%%capture

!pip install -U spacy

!pip install -U spacy-lookups-data

!python -m spacy download en_core_web_sm


(RUN)

Explanation:

1. pip install -U spacy: This command installs or upgrades the spaCy


library to its latest version. The -U flag ensures that it updates the
package if it's already installed.

2. pip install -U spacy-lookups-data: This installs or upgrades the


spaCy lookups data, which provides lookup tables for various
features like stop words or lemmatization, and is often required for
some spaCy functionalities, especially those depending on statistical
models.

3. python -m spacy download en_core_web_sm: This downloads


the en_core_web_sm model, which is a small English pipeline that
supports core spaCy capabilities and is trained on web text. It
includes the language data and binary weights for predictions like
part-of-speech tags, dependencies, and named entities.

## tokenizing a piecen of text


doc = "I love coding and writing"

for i, w in enumerate(doc.split(" ")):

print("Token " + str(i) + ": " + w)

Explanation:

The given code snippet effectively demonstrates a simple form


of string tokenization in Python.

Here's a breakdown of the code and its functionality:

 doc = "I love coding and writing": The string variable doc is
initialized with the text "I love coding and writing".

 doc.split(" "): This method splits the string into a list of


substrings (or tokens) wherever a space (" ") is
encountered. The result will be ['I', 'love', 'coding', 'and',
'writing'].

 enumerate(doc.split(" ")): The enumerate() function iterates


over the list returned by doc.split(" "). It provides the index
(i) and the value (w) for each item in the list.

 for i, w in enumerate(doc.split(" ")): This sets up a loop that


iterates through the enumerated list of tokens. In each
iteration, i holds the index (starting from 0) and w holds the
actual token (word).

 print("Token " + str(i) + ": " + w): Inside the loop, this line
prints the token's index and the token itself in the format
"Token [index]: [token]".

O/P: Token 0: I

Token 1: love

Token 2: coding

Token 3: and

Token 4: writing

Exercise 1: Copy the code from above and add extra whitespaces
to the string value assigned to the doc variable and identify the
issue with the code. Then try to fix the issue. Hint:
Use text.strip() to fix the problem.
LEMMATIZATION

import spacy

# Load the spaCy English model

nlp = spacy.load('en_core_web_sm')

# Define a sample text

text = "The quick brown foxes are jumping over the lazy dogs."

# Process the text using spaCy

doc = nlp(text)

# Extract lemmatized tokens

lemmatized_tokens = [token.lemma_ for token in doc]

# Join the lemmatized tokens into a sentence

lemmatized_text = ' '.join(lemmatized_tokens)

# Print the original and lemmatized text

print("Original Text:", text)

print("Lemmatized Text:", lemmatized_text)

Explanation:

Code Explanation

 import spacy: This line imports the necessary spaCy library.

 nlp = spacy.load('en_core_web_sm'): This line loads the pre-trained


English language model named "en_core_web_sm". This model
includes components like a tokenizer, part-of-speech (POS) tagger,
and lemmatizer, which are essential for tasks like lemmatization.
The "sm" indicates it is a small model.

 text = "The quick brown foxes are jumping over the lazy dogs.": This
line defines the input string that will be processed.

 doc = nlp(text): This line processes the input text using the loaded
spaCy model, creating a Doc object. The Doc object represents the
processed text and contains information about tokens, their
linguistic features, and relationships, according to DataCamp.
When nlp is called on the text, it first tokenizes the text to produce
a Doc object, then processes it through several steps, including POS
tagging and lemmatization.

 lemmatized_tokens = [token.lemma_ for token in doc]: This line


iterates over each token in the Doc object and extracts its lemma
using the token.lemma_ attribute. The token.lemma_ attribute
returns the base or dictionary form of a word, also known as the
lemma. For example, the lemma of "foxes" is "fox", and the lemma
of "jumping" is "jump".

 lemmatized_text = ' '.join(lemmatized_tokens): This line joins the list


of lemmatized tokens back into a single string, separated by spaces.

 print("Original Text:", text) and print("Lemmatized Text:",


lemmatized_text): These lines print the original text and the
lemmatized text for comparison.

O/P: Original Text: The quick brown foxes are jumping over the
lazy dogs.

Lemmatized Text: the quick brown fox be jump over the


lazy dog .

Words Lemmatization:

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

# Create WordNetLemmatizer object

wnl = WordNetLemmatizer()

# single word lemmatization examples

list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling’, 'driving', 'died', 'tried',


'feet']

for words in list1:

print(words + " ---> " + wnl.lemmatize(words))

output: kites ---> kite

babies ---> baby

dogs ---> dog


flying ---> flying

smiling ---> smiling

driving ---> driving

died ---> died

tried ---> tried

feet ---> foot

Explanation:

1. import nltk and nltk.download('wordnet'): These lines import


the Natural Language Toolkit (NLTK) and download the
WordNet corpus, a lexical database essential for
the WordNetLemmatizer to function correctly.

2. from nltk.stem import WordNetLemmatizer: This imports


the WordNetLemmatizer class from the NLTK stem module.

3. wnl = WordNetLemmatizer(): An instance of


the WordNetLemmatizer is created, which will be used to
perform lemmatization.

4. list1 = [...]: This list contains words in their inflected forms


(e.g., plurals, verbs in different tenses, etc.).

5. for words in list1: print(words + " ---> " +


wnl.lemmatize(words)):

 This loop iterates through each word in list1.

 wnl.lemmatize(words): This is the core function call


that performs the lemmatization. It takes a word as
input and returns its lemma or base form.

 print(...): This line prints the original word alongside its


lemmatized form

CASE FOLDING

Sol: import spacy

# Load language model

nlp = spacy.load("en_core_web_sm")
(Run)
 import spacy: This line imports the spaCy library, making its
functionalities available for use in the current Python script. spaCy is
a free, open-source library for advanced Natural Language
Processing (NLP) in Python.

 nlp = spacy.load("en_core_web_sm"): This line loads a pre-trained


English language model named "en_core_web_sm" into a spaCy
Language object, which is conventionally named nlp.

t = "The train to London leaves at 10am on Tuesday."

doc = nlp(t)

# Case fold

print([t.lower_ for t in doc]) (Run)

The spaCy token.lower_ attribute returns the lowercase


form of the token's text. In the given example, t = "The
train to London leaves at 10am on Tuesday." and doc =
nlp(t), iterating through the doc object and
printing t.lower_ for each token t will produce the
following output:

OUTPUT: ['the', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on',
'tuesday', '.']

STEMMING

Sol:

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

doc = 'I prefer not to argue'

for token in doc.split(" "):

` print(token, '=>' , stemmer.stem(token))

Explanation:
The Python code snippet you provided uses the NLTK library's Snowball
Stemmer to reduce words to their base or root forms. The Snowball
Stemmer, also known as the Porter2 Stemmer, is an improvement upon
the original Porter Stemmer and supports multiple languages.

The provided code executes as follows:

1. Import Snowball Stemmer: The from nltk.stem.snowball import


SnowballStemmer imports the class from the NLTK library.

2. Initialize Stemmer: The code creates an instance of the Snowball


Stemmer for English using stemmer =
SnowballStemmer(language='english').

3. Define Document: The string of text to be stemmed is defined


as doc = 'I prefer not to argue'.

4. Iterate and Stem:

 The code splits the doc string into individual words (tokens)
using doc.split(" ").

 It iterates through each token and applies


the stemmer.stem(token) method to reduce each word to its
root form.

 The original word and its stemmed version are then printed.

Output:

I => i

prefer => prefer

not => not

to => to

argue => argu

calculate the edit distance between text strings.


import nltk

string1 = "CAT"

string2 = "RAT"
distance = nltk.edit_distance(string1, string2)

print(f"The Levenshtein distance between '{string1}' and


'{string2}' is: {distance}")

Explanation:
Here's how it works with your input:

 string1 = "CAT"

 string2 = "RAT"

To transform "CAT" into "RAT", substituting 'C' with 'R' is


necessary. This is one edit operation. Therefore, the
Levenshtein distance is 1.

The nltk.edit_distance() function is part of the Natural


Language Toolkit (NLTK) library in Python, a suite of libraries
for natural language processing. This function calculates the
Levenshtein distance between two strings, as demonstrated in
the code snippet.

OUTPUT: The Levenshtein distance between 'CAT' and 'RAT' is: 1

The Python code snippet utilizes the spaCy library, a popular tool
for Natural Language Processing (NLP), to perform tokenization.

Here's a breakdown of the code:

 import spacy: This line imports the necessary spaCy library.

 nlp = spacy.load("en_core_web_sm"): This line loads the English


language model, "en_core_web_sm", which is a small-sized model
used for tasks like tokenization, part-of-speech tagging, and
lemmatization. The nlp variable becomes a callable object that
processes text.

 doc = nlp("This is the so-called 'lemmatization' "): The input string,


"This is the so-called 'lemmatization'", is passed to the nlp object.
This converts the string into a Doc object, which is a processed
representation of the text. During this process, spaCy implicitly
generates token objects representing individual words or
punctuation marks.

 for token in doc: print(token.text): This loop iterates through each


token in the doc object and prints the raw text of each token using
the token.text attribute.

Output

You might also like