Pip is Python's package installer, used for installing, managing, and
uninstalling software packages and their dependencies. It simplifies the
process of adding external libraries to your Python environment, making it
an essential tool for Python development.
!python -m spacy download en_core_web_sm
a small English pipeline trained on written web text (blogs, news,
comments), that includes vocabulary, syntax and entities.
1. To implement text preprocessing techniques such as
tokenization, case folding, stemming, lemmatization, and
calculate the edit distance between text strings. Sol:
TOKENIZATION
## required libraries that need to be installed
%%capture
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
(RUN)
Explanation:
1. pip install -U spacy: This command installs or upgrades the spaCy
library to its latest version. The -U flag ensures that it updates the
package if it's already installed.
2. pip install -U spacy-lookups-data: This installs or upgrades the
spaCy lookups data, which provides lookup tables for various
features like stop words or lemmatization, and is often required for
some spaCy functionalities, especially those depending on statistical
models.
3. python -m spacy download en_core_web_sm: This downloads
the en_core_web_sm model, which is a small English pipeline that
supports core spaCy capabilities and is trained on web text. It
includes the language data and binary weights for predictions like
part-of-speech tags, dependencies, and named entities.
## tokenizing a piecen of text
doc = "I love coding and writing"
for i, w in enumerate(doc.split(" ")):
print("Token " + str(i) + ": " + w)
Explanation:
The given code snippet effectively demonstrates a simple form
of string tokenization in Python.
Here's a breakdown of the code and its functionality:
doc = "I love coding and writing": The string variable doc is
initialized with the text "I love coding and writing".
doc.split(" "): This method splits the string into a list of
substrings (or tokens) wherever a space (" ") is
encountered. The result will be ['I', 'love', 'coding', 'and',
'writing'].
enumerate(doc.split(" ")): The enumerate() function iterates
over the list returned by doc.split(" "). It provides the index
(i) and the value (w) for each item in the list.
for i, w in enumerate(doc.split(" ")): This sets up a loop that
iterates through the enumerated list of tokens. In each
iteration, i holds the index (starting from 0) and w holds the
actual token (word).
print("Token " + str(i) + ": " + w): Inside the loop, this line
prints the token's index and the token itself in the format
"Token [index]: [token]".
O/P: Token 0: I
Token 1: love
Token 2: coding
Token 3: and
Token 4: writing
Exercise 1: Copy the code from above and add extra whitespaces
to the string value assigned to the doc variable and identify the
issue with the code. Then try to fix the issue. Hint:
Use text.strip() to fix the problem.
LEMMATIZATION
import spacy
# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')
# Define a sample text
text = "The quick brown foxes are jumping over the lazy dogs."
# Process the text using spaCy
doc = nlp(text)
# Extract lemmatized tokens
lemmatized_tokens = [token.lemma_ for token in doc]
# Join the lemmatized tokens into a sentence
lemmatized_text = ' '.join(lemmatized_tokens)
# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)
Explanation:
Code Explanation
import spacy: This line imports the necessary spaCy library.
nlp = spacy.load('en_core_web_sm'): This line loads the pre-trained
English language model named "en_core_web_sm". This model
includes components like a tokenizer, part-of-speech (POS) tagger,
and lemmatizer, which are essential for tasks like lemmatization.
The "sm" indicates it is a small model.
text = "The quick brown foxes are jumping over the lazy dogs.": This
line defines the input string that will be processed.
doc = nlp(text): This line processes the input text using the loaded
spaCy model, creating a Doc object. The Doc object represents the
processed text and contains information about tokens, their
linguistic features, and relationships, according to DataCamp.
When nlp is called on the text, it first tokenizes the text to produce
a Doc object, then processes it through several steps, including POS
tagging and lemmatization.
lemmatized_tokens = [token.lemma_ for token in doc]: This line
iterates over each token in the Doc object and extracts its lemma
using the token.lemma_ attribute. The token.lemma_ attribute
returns the base or dictionary form of a word, also known as the
lemma. For example, the lemma of "foxes" is "fox", and the lemma
of "jumping" is "jump".
lemmatized_text = ' '.join(lemmatized_tokens): This line joins the list
of lemmatized tokens back into a single string, separated by spaces.
print("Original Text:", text) and print("Lemmatized Text:",
lemmatized_text): These lines print the original text and the
lemmatized text for comparison.
O/P: Original Text: The quick brown foxes are jumping over the
lazy dogs.
Lemmatized Text: the quick brown fox be jump over the
lazy dog .
Words Lemmatization:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling’, 'driving', 'died', 'tried',
'feet']
for words in list1:
print(words + " ---> " + wnl.lemmatize(words))
output: kites ---> kite
babies ---> baby
dogs ---> dog
flying ---> flying
smiling ---> smiling
driving ---> driving
died ---> died
tried ---> tried
feet ---> foot
Explanation:
1. import nltk and nltk.download('wordnet'): These lines import
the Natural Language Toolkit (NLTK) and download the
WordNet corpus, a lexical database essential for
the WordNetLemmatizer to function correctly.
2. from nltk.stem import WordNetLemmatizer: This imports
the WordNetLemmatizer class from the NLTK stem module.
3. wnl = WordNetLemmatizer(): An instance of
the WordNetLemmatizer is created, which will be used to
perform lemmatization.
4. list1 = [...]: This list contains words in their inflected forms
(e.g., plurals, verbs in different tenses, etc.).
5. for words in list1: print(words + " ---> " +
wnl.lemmatize(words)):
This loop iterates through each word in list1.
wnl.lemmatize(words): This is the core function call
that performs the lemmatization. It takes a word as
input and returns its lemma or base form.
print(...): This line prints the original word alongside its
lemmatized form
CASE FOLDING
Sol: import spacy
# Load language model
nlp = spacy.load("en_core_web_sm")
(Run)
import spacy: This line imports the spaCy library, making its
functionalities available for use in the current Python script. spaCy is
a free, open-source library for advanced Natural Language
Processing (NLP) in Python.
nlp = spacy.load("en_core_web_sm"): This line loads a pre-trained
English language model named "en_core_web_sm" into a spaCy
Language object, which is conventionally named nlp.
t = "The train to London leaves at 10am on Tuesday."
doc = nlp(t)
# Case fold
print([t.lower_ for t in doc]) (Run)
The spaCy token.lower_ attribute returns the lowercase
form of the token's text. In the given example, t = "The
train to London leaves at 10am on Tuesday." and doc =
nlp(t), iterating through the doc object and
printing t.lower_ for each token t will produce the
following output:
OUTPUT: ['the', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on',
'tuesday', '.']
STEMMING
Sol:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')
doc = 'I prefer not to argue'
for token in doc.split(" "):
` print(token, '=>' , stemmer.stem(token))
Explanation:
The Python code snippet you provided uses the NLTK library's Snowball
Stemmer to reduce words to their base or root forms. The Snowball
Stemmer, also known as the Porter2 Stemmer, is an improvement upon
the original Porter Stemmer and supports multiple languages.
The provided code executes as follows:
1. Import Snowball Stemmer: The from nltk.stem.snowball import
SnowballStemmer imports the class from the NLTK library.
2. Initialize Stemmer: The code creates an instance of the Snowball
Stemmer for English using stemmer =
SnowballStemmer(language='english').
3. Define Document: The string of text to be stemmed is defined
as doc = 'I prefer not to argue'.
4. Iterate and Stem:
The code splits the doc string into individual words (tokens)
using doc.split(" ").
It iterates through each token and applies
the stemmer.stem(token) method to reduce each word to its
root form.
The original word and its stemmed version are then printed.
Output:
I => i
prefer => prefer
not => not
to => to
argue => argu
calculate the edit distance between text strings.
import nltk
string1 = "CAT"
string2 = "RAT"
distance = nltk.edit_distance(string1, string2)
print(f"The Levenshtein distance between '{string1}' and
'{string2}' is: {distance}")
Explanation:
Here's how it works with your input:
string1 = "CAT"
string2 = "RAT"
To transform "CAT" into "RAT", substituting 'C' with 'R' is
necessary. This is one edit operation. Therefore, the
Levenshtein distance is 1.
The nltk.edit_distance() function is part of the Natural
Language Toolkit (NLTK) library in Python, a suite of libraries
for natural language processing. This function calculates the
Levenshtein distance between two strings, as demonstrated in
the code snippet.
OUTPUT: The Levenshtein distance between 'CAT' and 'RAT' is: 1
The Python code snippet utilizes the spaCy library, a popular tool
for Natural Language Processing (NLP), to perform tokenization.
Here's a breakdown of the code:
import spacy: This line imports the necessary spaCy library.
nlp = spacy.load("en_core_web_sm"): This line loads the English
language model, "en_core_web_sm", which is a small-sized model
used for tasks like tokenization, part-of-speech tagging, and
lemmatization. The nlp variable becomes a callable object that
processes text.
doc = nlp("This is the so-called 'lemmatization' "): The input string,
"This is the so-called 'lemmatization'", is passed to the nlp object.
This converts the string into a Doc object, which is a processed
representation of the text. During this process, spaCy implicitly
generates token objects representing individual words or
punctuation marks.
for token in doc: print(token.text): This loop iterates through each
token in the doc object and prints the raw text of each token using
the token.text attribute.
Output