Text Preprocessing 13/01/25, 21:05
Text Preprocessing with spaCy and nltk
This lab program demonstrates how to preprocess text using two powerful libraries:
spaCy and nltk. Text preprocessing is a crucial step in Natural Language Processing
(NLP) pipelines. It involves cleaning and preparing text data for analysis or model
training.
Steps Covered
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Lemmatization
5. Stemming (nltk)
Let's get started!
Install and Import Libraries
In [ ]: !pip install spacy nltk -q
# Download spaCy model and nltk resources
!python -m spacy download en_core_web_sm -q
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
# Import libraries
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
Text Dataset
For this lab, we'll use a small dataset of sentences that simulate real-world text data.
You can replace this with any dataset of your choice.
In [2]: # Define a sample text dataset
text = """
Natural Language Processing (NLP) is a fascinating field of Artificial Intellig
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 1 of 3
Text Preprocessing 13/01/25, 21:05
It focuses on enabling computers to understand, interpret, and respond to human
With the rise of large language models, the scope of NLP has expanded significa
"""
Tokenization
In [ ]: print("Tokenization with nltk:")
sentences_nltk = sent_tokenize(text)
print("Sentences:", sentences_nltk)
words_nltk = word_tokenize(text)
print("\nWords:", words_nltk)
# Tokenization with spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("\nTokenization with spaCy:")
print("Tokens:", [token.text for token in doc])
Lowercasing
Lowercasing converts all text to lowercase, which helps in standardising text.
In [ ]: # Lowercasing with nltk
lowercased_words_nltk = [word.lower() for word in words_nltk]
print("Lowercased Words (nltk):", lowercased_words_nltk)
# Lowercasing with spaCy
lowercased_words_spacy = [token.text.lower() for token in doc]
print("Lowercased Words (spaCy):", lowercased_words_spacy)
Stopword Removal
Stopwords are common words (like "the", "is", "and") that add little meaning to text
and can be removed.
In [ ]: # Stopword removal with nltk
stop_words = set(stopwords.words("english"))
filtered_words_nltk = [word for word in words_nltk if word.lower() not in stop_
print("Filtered Words (nltk):", filtered_words_nltk)
# Stopword removal with spaCy
filtered_words_spacy = [token.text for token in doc if not token.is_stop]
print("Filtered Words (spaCy):", filtered_words_spacy)
Lemmatization
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 2 of 3
Text Preprocessing 13/01/25, 21:05
Lemmatization reduces words to their base or root form (e.g., "running" becomes
"run").
In [ ]: # Lemmatization with nltk
lemmatizer = WordNetLemmatizer()
lemmatized_words_nltk = [lemmatizer.lemmatize(word) for word in filtered_words_
print("Lemmatized Words (nltk):", lemmatized_words_nltk)
# Lemmatization with spaCy
lemmatized_words_spacy = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmatized Words (spaCy):", lemmatized_words_spacy)
Stemming
Stemming reduces words to their root form by chopping off suffixes.
In [ ]: stemmer = PorterStemmer()
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_words_nltk]
print("Stemmed Words (nltk):", stemmed_words_nltk)
Conclusion
In this lab, we explored various text preprocessing steps using nltk and spaCy. These
steps are foundational for any NLP task and play a vital role in improving the
performance of machine learning models in NLP. Feel free to experiment with
different datasets and observe the results!
Key Takeaways
nltk and spaCy provide powerful tools for text preprocessing.
Both libraries have unique strengths, with nltk offering traditional NLP tools and
spaCy excelling in modern NLP pipelines.
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 3 of 3