NLP Lab Manual-1
NLP Lab Manual-1
NLP Lab Manual-1
Python Libraries:
nltk, re,word2vec
List of Experiments
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hashtag from textual data.
2. Perform lemmatization and stemming using python library nltk.
3. Demonstrate object standardization such as replace social media slangs from a text.
4. Perform part of speech tagging on any textual data.
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python
7. Demonstrate word embeddings using word2vec.
8. Implement Text classification using naïve bayes classifier and text blob library.
9. Apply support vector machine for text classification.
10. Convert text to vectors (using term frequency) and apply cosine similarity to
provide closenessamong two text.
11. Case study 1: Identify the sentiment of tweets
In this problem, you are provided with tweet data to predict sentiment on
electronicproducts of netizens.
12. Case study 2: Detect hate speech in tweets.
The objective of this task is to detect hate speech in tweets. For the sake of
simplicity, wesay a tweet contains hate speech if it has a racist or sexist
sentiment associated with it.
So, the task is to classify racist or sexist tweets from other tweets.
NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) refers to the branch of computer science—and more
specifically, the branch of artificial intelligence or AI—concerned with giving computers the
ability to understand text and spoken words in much the same way human beings can.
NLP drives computer programs that translate text from one language to another, respond to
spoken commands, and summarize large volumes of text rapidly—even in real time. There’s a
good chance you’ve interacted with NLP in the form of voice-operated GPS systems, digital
assistants, speech-to-text dictation software, customer service chat bots, and other consumer
conveniences. But NLP also plays a growing role in enterprise solutions that help streamline
business operations, increase employee productivity, and simplify mission-critical business
processes.
Several NLP tasks break down human text and voice data in ways that help the computer
make sense of what it's ingesting. Some of these tasks include the following:
Speech recognition, also called speech-to-text, is the task of reliably converting voice
data into text data. Speech recognition is required for any application that follows voice
commands or answers spoken questions. What makes speech recognition especially
challenging is the way people talk—quickly, slurring words together, with varying
emphasis and intonation, in different accents, and often using incorrect grammar.
Part of speech tagging, also called grammatical tagging, is the process of determining
the part of speech of a particular word or piece of text based on its use and context. Part
of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in
‘what make of car do you own?’
Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes the
most sense in the given context. For example, word sense disambiguation helps
distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’
(place).
Named entity recognition, or NEM, identifies words or phrases as useful entities. NEM
identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
Co-reference resolution is the task of identifying if and when two words refer to the
same entity. The most common example is determining the person or object to which a
certain pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve identifying a
metaphor or an idiom in the text (e.g., an instance in which 'bear' isn't an animal but a
large hairy person).
NATURAL LANGUAGE PROCESSING
The Python programming language provides a wide range of tools and libraries for attacking
specific NLP tasks. Many of these are found in the Natural Language Toolkit, or NLTK, an open
source collection of libraries, programs, and education resources for building NLP programs.
The NLTK includes libraries for many of the NLP tasks listed above, plus libraries for subtasks,
such as sentence parsing, word segmentation, stemming and lemmatization (methods of
trimming words down to their roots), and tokenization (for breaking phrases, sentences,
paragraphs and passages into tokens that help the computer better understand the text). It also
includes libraries for implementing capabilities such as semantic reasoning, the ability to reach
logical conclusions based on facts extracted from text.
Download NLTK data: run python shell (in terminal) and write the following code:
import nltk nltk.download()
What is Word2Vec?
Word2Vec is a widely used method in natural language processing (NLP) that allows
words to be represented as vectors in a continuous vector space. Word2Vec is an effort to map
words to high-dimensional vectors to capture the semantic relationships between words,
developed by researchers at Google. Words with similar meanings should have similar vector
representations, according to the main principle of Word2Vec.
The basic idea of word embedding is words that occur in similar context tend to be closer
to each other in vector space. For generating word vectors in Python, modules needed
are nltk and gensim. Run these commands in terminal to install nltk and gensim:
NLTK: For handling human language data, NLTK, or Natural Language Toolkit, is a
potent Python library. It offers user-friendly interfaces to more than 50 lexical resources
and corpora, including WordNet. A collection of text processing libraries for tasks like
categorization, tokenization, stemming, tagging, parsing, and semantic reasoning are also
included with NLTK.
GENSIM: Gensim is an open-source Python library that uses topic modeling and
document similarity modeling to manage and analyze massive amounts of unstructured
text data. It is especially well-known for applying topic and vector space modeling
algorithms, such as Word2Vec and Latent Dirichlet Allocation (LDA), which are widely
used.
NATURAL LANGUAGE PROCESSING
Experiment 1: Demonstrate Noise Removal for any textual data and remove regular
expression pattern such as hashtag from textual data.
Description: Noise removal is a crucial preprocessing step in natural language processing
(NLP) tasks. It involves cleaning up textual data by removing irrelevant or unwanted
information that does not contribute to the meaning of the text. This can include special
characters, punctuation, numbers, and other non-alphabetic symbols. Regular expressions are
often used to identify and remove specific patterns of noise, such as hashtags (#) in social
media data. Noise removal helps in improving the quality of textual data for further analysis or
processing, such as sentiment analysis, text classification, or language modeling. It simplifies
the text and reduces the dimensionality of the data, making it easier for machine learning
algorithms to extract meaningful insights.
Code:
import re
def remove_noise(text):
# Define the regular expression pattern for hashtags
hashtag_pattern = r'#\w+'
print("Original Text:")
print(text_with_noise)
print("\nText after Noise Removal:")
print(clean_text)
Output:
Original Text:
Just finished reading #HarryPotter and it was #amazing! #booklovers
NATURAL LANGUAGE PROCESSING
Description:
Lemmatization and stemming are both techniques used in natural language processing to
reduce words to their base or root form. While stemming involves chopping off prefixes or
suffixes of words to obtain the root, lemmatization involves using vocabulary and morphological
analysis to return the base or dictionary form of a word, which is known as the lemma. The
NLTK library provides a robust toolkit for natural language processing tasks, including
tokenization, lemmatization, stemming, part-of-speech tagging, and more. These techniques are
often used in preprocessing text data before further analysis or modeling.
Code:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
def lemmatize_text(text):
# Tokenize the text into words
tokens = word_tokenize(text)
def stem_text(text):
# Tokenize the text into words
tokens = word_tokenize(text)
# Example text
text = "nowadays students all are need to learn the programming"
Output:
Lemmatized Text:
nowaday student all are need to learn the programming
Stemmed Text:
nowadays students all are need to learn the program
NATURAL LANGUAGE PROCESSING
Experiment 3: Demonstrate object standardization such as replace social media slangs from a
text.
Description:
Standardizing text by replacing social media slang with formal language involves
converting informal or colloquial language commonly used in social media platforms into more
formal language. This process is typically done to improve readability, professionalism, or
clarity in various contexts, such as formal documents, academic writing, or professional
communications.
Code:
# Define a dictionary mapping social media slangs to their formal equivalents
slang_to_formal = {
"lol": "laughing out loud",
"brb": "be right back",
"omg": "oh my god",
"btw": "by the way",
"ttyl": "talk to you later",
# Add more mappings as needed
}
# Example usage
input_text = "omg, lol, brb! btw, ttyl!"
standardized_text = standardize_text(input_text)
print("Original text:", input_text)
NATURAL LANGUAGE PROCESSING
Output:
Original text: omg, lol, brb! btw, ttyl!
Standardized text: oh my god, laughing out loud, be right back! by the way, talk to you later!
Description:
Part-of-speech (POS) tagging is a process in natural language processing where words in
a text are assigned to their respective parts of speech, such as nouns, verbs, adjectives, adverbs,
etc. This tagging helps in understanding the grammatical structure of sentences and is an
essential step in many NLP tasks, such as text analysis, information extraction, and sentiment
analysis.Python's NLTK library provides easy-to-use tools for POS tagging.
Code:
import nltk
from nltk.tokenize import word_tokenize
Output:
Original Text: I love to eat pizza with extra cheese.
Part-of-Speech Tagging:
I: PRP
love: VBP
to: TO
NATURAL LANGUAGE PROCESSING
eat: VB
pizza: NN
with: IN
extra: JJ
cheese: NN
Explanation:
Experiment 5: Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Description:
Latent Dirichlet Allocation (LDA) is a popular technique for topic modeling, which is
used to discover latent topics present in a collection of documents.
Code:
import gensim
from gensim import corpora
from pprint import pprint
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Natural language processing is used in many applications.",
"Deep learning models have achieved state-of-the-art results.",
"Topic modeling is an unsupervised learning technique.",
"Python programming language is commonly used in data science.",
"Text data preprocessing is important for NLP tasks.",
]
Output:
Topics and their top words:
[(0,
'0.079*"learning" + 0.063*"machine" + 0.046*"intelligence" + 0.046*"artificial" +
0.031*"subset" + 0.031*"natural" + 0.031*"processing" + 0.031*"applications" + 0.031*"used"
+ 0.031*"many"'),
(1,
'0.090*"learning" + 0.070*"deep" + 0.070*"models" + 0.045*"state" + 0.045*"achieved" +
0.045*"results" + 0.045*"art" + 0.045*"of" + 0.045*"have" + 0.045*"modeling"'),
(2,
'0.075*"topic" + 0.061*"unsupervised" + 0.061*"is" + 0.061*"an" + 0.061*"modeling" +
0.061*"learning" + 0.061*"technique" + 0.036*"preprocessing" + 0.036*"important" +
0.036*"data"')]
Each line represents a topic along with its top words and their corresponding probabilities. For
example, the first line shows the top words for topic 0, the second line shows the top words for
topic 1, and so on.
You can interpret these topics as follows:
Topic 0: Machine learning and artificial intelligence
Topic 1: Deep learning models and achievements
Topic 2: Topic modeling and unsupervised learning techniques
NATURAL LANGUAGE PROCESSING
Experiment 6: Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"TF-IDF stands for Term Frequency-Inverse Document Frequency.",
"It is a technique used in Natural Language Processing and Information Retrieval.",
"TF-IDF quantifies the importance of a term in a document relative to the entire corpus.",
"It helps in identifying the significance of words in a document.",
]
# Fit the vectorizer to the documents and transform the documents into TF-IDF vectors
tfidf_vectors = tfidf_vectorizer.fit_transform(documents)
Output:
Document 1:
and: 0.0000
document: 0.3536
frequency: 0.3536
idf: 0.3536
importance: 0.0000
in: 0.0000
information: 0.0000
is: 0.0000
it: 0.0000
language: 0.0000
natural: 0.0000
of: 0.0000
processing: 0.3536
quantifies: 0.3536
relative: 0.3536
stands: 0.3536
technique: 0.3536
term: 0.3536
tf: 0.3536
the: 0.0000
NATURAL LANGUAGE PROCESSING
Description:
Word embeddings are numerical representations of words in a continuous vector space. They encode semantic in
component in various NLP tasks, including language modeling, sentiment analysis, machine
translation, and named entity recognition.
In Python, there are several popular methods and libraries for generating word embeddings, with
Word2Vec, GloVe, and FastText being among the most commonly used ones. Here's a brief
overview of word2vec:
Code:
# Sample corpus
corpus = [
["I", "love", "coding"],
["Machine", "learning", "is", "fun"],
["Python", "is", "a", "popular", "programming", "language"],
["Word", "embeddings", "capture", "semantic", "meanings"] ]
Output:
Most similar words to 'learning':
language: 0.08412346255779266
fun: 0.0378827781085968
semantic: -0.003678104073524475
Experiment 8: Implement Text classification using naïve bayes classifier and text blob
library.
Description:
Naïve Bayes classifier is a simple probabilistic classifier based on Bayes' theorem with
the assumption of independence between features. Despite its simplicity, it often performs well
in practice, especially for text classification tasks. It's called "naïve" because it assumes that all
features are independent of each other, which is rarely the case in real-world data, especially in
natural language processing tasks. However, despite this simplification, Naïve Bayes classifiers
can perform surprisingly well in many situations.
TextBlob is a Python library built on top of NLTK (Natural Language Toolkit) and
Pattern libraries, providing an easy-to-use interface for common NLP tasks. It includes
functionalities for text processing, such as tokenization, part-of-speech tagging, noun phrase
extraction, and sentiment analysis. TextBlob also provides a simple API for text classification
using various classifiers, including Naïve Bayes.
Code:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Output:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
negative 1.00 1.00 1.00 1
positive 1.00 1.00 1.00 1
accuracy 1.00 2
macro avg 1.00 1.00 1.00 2
weighted avg 1.00 1.00 1.00 2