0% found this document useful (0 votes)

17 views3 pages

Text Preprocessing

This document outlines a lab program for text preprocessing using the libraries spaCy and nltk, essential for preparing text data in Natural Language Processing (NLP). It covers key steps such as tokenization, lowercasing, stopword removal, lemmatization, and stemming, along with code examples for each step. The conclusion emphasizes the importance of these preprocessing techniques in enhancing machine learning model performance in NLP tasks.

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views3 pages

Text Preprocessing

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Text Preprocessing 13/01/25, 21:05

Text Preprocessing with spaCy and nltk

This lab program demonstrates how to preprocess text using two powerful libraries:
spaCy and nltk. Text preprocessing is a crucial step in Natural Language Processing
(NLP) pipelines. It involves cleaning and preparing text data for analysis or model
training.

Steps Covered
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Lemmatization
5. Stemming (nltk)

Let's get started!

Install and Import Libraries

In [ ]: !pip install spacy nltk -q

# Download spaCy model and nltk resources

!python -m spacy download en_core_web_sm -q
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Import libraries
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

Text Dataset
For this lab, we'll use a small dataset of sentences that simulate real-world text data.
You can replace this with any dataset of your choice.

In [2]: # Define a sample text dataset

text = """
Natural Language Processing (NLP) is a fascinating field of Artificial Intellig

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 1 of 3
Text Preprocessing 13/01/25, 21:05

It focuses on enabling computers to understand, interpret, and respond to human

With the rise of large language models, the scope of NLP has expanded significa
"""

Tokenization

In [ ]: print("Tokenization with nltk:")

sentences_nltk = sent_tokenize(text)
print("Sentences:", sentences_nltk)

words_nltk = word_tokenize(text)
print("\nWords:", words_nltk)

# Tokenization with spaCy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("\nTokenization with spaCy:")

print("Tokens:", [token.text for token in doc])

Lowercasing
Lowercasing converts all text to lowercase, which helps in standardising text.

In [ ]: # Lowercasing with nltk

lowercased_words_nltk = [word.lower() for word in words_nltk]
print("Lowercased Words (nltk):", lowercased_words_nltk)

# Lowercasing with spaCy

lowercased_words_spacy = [token.text.lower() for token in doc]
print("Lowercased Words (spaCy):", lowercased_words_spacy)

Stopword Removal
Stopwords are common words (like "the", "is", "and") that add little meaning to text
and can be removed.

In [ ]: # Stopword removal with nltk

stop_words = set(stopwords.words("english"))
filtered_words_nltk = [word for word in words_nltk if word.lower() not in stop_
print("Filtered Words (nltk):", filtered_words_nltk)

# Stopword removal with spaCy

filtered_words_spacy = [token.text for token in doc if not token.is_stop]
print("Filtered Words (spaCy):", filtered_words_spacy)

Lemmatization
file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 2 of 3
Text Preprocessing 13/01/25, 21:05

Lemmatization reduces words to their base or root form (e.g., "running" becomes
"run").

In [ ]: # Lemmatization with nltk

lemmatizer = WordNetLemmatizer()
lemmatized_words_nltk = [lemmatizer.lemmatize(word) for word in filtered_words_
print("Lemmatized Words (nltk):", lemmatized_words_nltk)

# Lemmatization with spaCy

lemmatized_words_spacy = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmatized Words (spaCy):", lemmatized_words_spacy)

Stemming
Stemming reduces words to their root form by chopping off suffixes.

In [ ]: stemmer = PorterStemmer()
stemmed_words_nltk = [stemmer.stem(word) for word in filtered_words_nltk]
print("Stemmed Words (nltk):", stemmed_words_nltk)

Conclusion
In this lab, we explored various text preprocessing steps using nltk and spaCy. These
steps are foundational for any NLP task and play a vital role in improving the
performance of machine learning models in NLP. Feel free to experiment with
different datasets and observe the results!

Key Takeaways
nltk and spaCy provide powerful tools for text preprocessing.
Both libraries have unique strengths, with nltk offering traditional NLP tools and
spaCy excelling in modern NLP pipelines.

file:///Users/ujjwalmk/Documents/PESU%20TA/LLM-TA/My%20Work/Labs/Lab1/Text%20Preprocessing.html Page 3 of 3

WOW English Grammar CBSE - CH 1-4 - Class 05
No ratings yet
WOW English Grammar CBSE - CH 1-4 - Class 05
25 pages
Token Ization
No ratings yet
Token Ization
5 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLPEXP3[1]
No ratings yet
NLPEXP3[1]
3 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Spacy Library
No ratings yet
Spacy Library
3 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
TP1 3
No ratings yet
TP1 3
5 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Exp2 Ananya 66 C NLP
No ratings yet
Exp2 Ananya 66 C NLP
7 pages
Date: Practical No.4:: Foundation of AI and ML (4351601)
No ratings yet
Date: Practical No.4:: Foundation of AI and ML (4351601)
10 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP Exp 2
No ratings yet
NLP Exp 2
4 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Lab 2
No ratings yet
Lab 2
4 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Experiment 2 Manual
No ratings yet
Experiment 2 Manual
6 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Irs Week 2
No ratings yet
Irs Week 2
2 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
34 - Three Address Code
No ratings yet
34 - Three Address Code
30 pages
AOML
No ratings yet
AOML
14 pages
Aoml Projj
No ratings yet
Aoml Projj
11 pages
Prompting Techniques Slide Deck
No ratings yet
Prompting Techniques Slide Deck
29 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
12 pages
Machine Translation
No ratings yet
Machine Translation
10 pages
Regularization in Linear Regression
No ratings yet
Regularization in Linear Regression
1 page
TF Idf
No ratings yet
TF Idf
6 pages
2nd Grade Intervention-Lesson Plan
No ratings yet
2nd Grade Intervention-Lesson Plan
2 pages
DLP 1
No ratings yet
DLP 1
7 pages
Zbornik Radova: January 2018
No ratings yet
Zbornik Radova: January 2018
17 pages
Bulacan State University: Civil Engineering, Engineering, Survey Method, Proficiency, Textism
No ratings yet
Bulacan State University: Civil Engineering, Engineering, Survey Method, Proficiency, Textism
44 pages
Quiz Unit 1 & 2
No ratings yet
Quiz Unit 1 & 2
7 pages
NAA ELA NVAC Connectors Grade7
No ratings yet
NAA ELA NVAC Connectors Grade7
7 pages
1 Morphology - Yule, 2010
100% (1)
1 Morphology - Yule, 2010
4 pages
03 - Kennedy Levy L' Italiano Al Telefonino
No ratings yet
03 - Kennedy Levy L' Italiano Al Telefonino
16 pages
Prepositions Some Concepts
No ratings yet
Prepositions Some Concepts
12 pages
Shake It Off Worksheet
No ratings yet
Shake It Off Worksheet
4 pages
Preposition
No ratings yet
Preposition
21 pages
Classroom Management
No ratings yet
Classroom Management
48 pages
Synonyms of Firmly in English - Google Search
No ratings yet
Synonyms of Firmly in English - Google Search
1 page
PP3 QP English
No ratings yet
PP3 QP English
3 pages
Learning Activity #4 - Daily Routine
No ratings yet
Learning Activity #4 - Daily Routine
8 pages
English For Communication - Lecture 3
No ratings yet
English For Communication - Lecture 3
27 pages
Past and Past Perfect Tense
0% (1)
Past and Past Perfect Tense
31 pages
Tense Chart in English
No ratings yet
Tense Chart in English
8 pages
Reference Card: English I
No ratings yet
Reference Card: English I
3 pages
IR Assignment Article Review 2023
No ratings yet
IR Assignment Article Review 2023
7 pages
The Role of Material
No ratings yet
The Role of Material
10 pages
Enl 115
No ratings yet
Enl 115
2 pages
LP4 PRONOUN ANTECEDENT AGREEMENTdocx
No ratings yet
LP4 PRONOUN ANTECEDENT AGREEMENTdocx
10 pages
Oye Padhle English
No ratings yet
Oye Padhle English
5 pages
Research Paper
No ratings yet
Research Paper
12 pages
Narrative Reports On ORV
No ratings yet
Narrative Reports On ORV
3 pages
Ges 108 Key Points by Oyindamola...
No ratings yet
Ges 108 Key Points by Oyindamola...
20 pages
Instrumento de Evaluacion P.simple-Punto 7
No ratings yet
Instrumento de Evaluacion P.simple-Punto 7
2 pages
English Linguistic - Prosodic Suprasegmental Phonemes - Group 6
No ratings yet
English Linguistic - Prosodic Suprasegmental Phonemes - Group 6
9 pages

Text Preprocessing

Uploaded by

Text Preprocessing

Uploaded by

Text Preprocessing 13/01/25, 21:05

Text Preprocessing with spaCy and nltk

Let's get started!

Install and Import Libraries

In [ ]: !pip install spacy nltk -q

# Download spaCy model and nltk resources

In [2]: # Define a sample text dataset

It focuses on enabling computers to understand, interpret, and respond to human

In [ ]: print("Tokenization with nltk:")

# Tokenization with spaCy

print("\nTokenization with spaCy:")

In [ ]: # Lowercasing with nltk

# Lowercasing with spaCy

In [ ]: # Stopword removal with nltk

# Stopword removal with spaCy

In [ ]: # Lemmatization with nltk

# Lemmatization with spaCy

You might also like