NLPs Lab Manual
NLPs Lab Manual
USN:
Branch: Section:
Prepared By:
Dr S.Anu Pallavi
Assistant Professor Grade-1, Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107
Reviewed By:
Dr. Nagendra J.
Associate Professor, Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107
Approved By:
Dr. Vijayshekhar S. S.
Head of the Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107
VISION
“Acharya Institute of Technology, committed to the cause of sustainable value-based education
in all disciplines, envisions itself as a global fountainhead of innovative human enterprise, with
inspirational initiatives for Academic Excellence”.
MISSION
“Acharya Institute of Technology strives to provide excellent academic ambiance to the students
for achieving global standards of technical education, foster intellectual and personal
development, meaningful research and ethical service to sustainable societal needs.”
VISION
To be recognized as the leader in the field of Artificial Intelligence and Machine Learning by
nurturing and producing quality next-generation academicians and researchers with human
values, who are creative, innovative, and versatile in this fast-growing field.
MISSION
The Department of Artificial Intelligence and Machine Learning (AI and ML) @ Acharya Institute
of Technology’s mission is to produce quality students with a sound understanding of the
fundamentals of the theory and practice of Artificial Intelligence and Machine Learning. The
mission is also to enable students to be leaders in the industry and academia nationally and
internationally. Finally, the mission is to meet the strong demands of the nation in the areas of
Artificial Intelligence and Machine Learning.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
Table of Contents
Academic
Course Title Course Code Core/Elective Semester
Year
NATURAL LANGUAGE
VI(AIML)
PROCESSING BAI601 AEC 2025-26
2 0 - 2
List Of Prerequisites:
DO’S
1. Please leave footwear outside the laboratory at the designated place.
2. Please keep your belongings such as bags in the designated place.
3. Turn off the respective systems and arrange the chairs before you
4. leaving the laboratory.
5. Maintain Silence and Discipline inside the laboratory.
6. Wear your ID card & Carry observation and record while coming to
7. the laboratory.
DON'TS
1. Don’t use mobile cell phones during lab hours.
2. Do not eat food, chew gum in the laboratory.
3. Do not install, uninstall or alter any software on the computer.
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
7
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26
RBT
Course Outcomes:
Level
After the completion of the course, Students will be able to:
CO5 ● Should know how to generate the output based on the above concepts L4
v. CO-PO-PSO Mapping:
Course Outcomes - Program Outcomes – Program Specific Outcomes mapping:
Program Specific
Program Outcomes
Outcomes
COs
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO-1 3 - 2 - 3 - - - - - - - - - 3
CO-2 3 - 2 - 3 - - - - - - - - - 3
CO-3 3 - 2 - 3 - - - 2 - - - - - 3
CO-4 3 - 2 - 3 - - - - - - 2 - - 3
CO-5 3 - 2 - 3 - - - - - - - - - 3
B. COURSE PLAN:
Hr.
No Experiment Content
.
1 1
Write a Python program for the following preprocessing of text in NLP:
● Tokenization
● Filtration
● Script Validation
● Stop Word Removal
● Stemming the given the code
Given the following short movie reviews, each labeled with a genre, either
5 5
comedy or action:
• fun, couple, love, love comedy
• fast, furious, shoot action
• couple, fly, fast, fun, fun comedy
• furious, shoot, shoot, fun action
• fly, fast, shoot, love action and
A new document D: fast, couple, shoot, fly
Compute the most likely class for D. Assume a Naive Bayes classifier and use
add-1 smoothing for the likelihoods.
Demonstrate the following using appropriate programming tool which illustrates the
6 6
use of information retrieval in NLP:
● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various
methods like filelds, raw, words, sents, categories
● Create and use your own corpora (plaintext, categorical)
● Study Conditional frequency distributions
● Study of tagged corpora with methods like tagged_sents, tagged_words
● Write a program to find the most frequent noun tags
● Map Words to Properties Using Python Dictionaries
● Study Rule based tagger, Unigram Tagger
Find different words from a given plain text without any space by comparing this text
with a given corpus of words. Also find the score of words.
7 7 Write a Python program to find synonyms and antonyms of the word "active"
using WordNet.
NOTES:
NLP stands for Natural Language Processing. It’s a field of artificial intelligence (AI)
focused on the interaction between computers and human (natural) languages. The goal of
NLP is to enable computers to understand, interprete, and generate human language in a way
that is both meaningful and useful.
iii)Chatbots
Q1. Write a Python program for the following preprocessing of text in NLP:
* Tokenization
* Filtration
* Script Validation
* Stop Word Removal
* Stemming give the code
Program 1:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# 1. Tokenization
def tokenize(text):
return word_tokenize(text)
# 5. Stemming
def stem_tokens(tokens):
stemmer = PorterStemmer()
return [stemmer.stem(token) for token in tokens]
# Step 1: Tokenization
tokens = tokenize(text)
print(f"Tokens:\n{tokens}\n")
# Step 2: Filtration
filtered_tokens = filter_text(tokens)
print(f"Filtered Tokens:\n{filtered_tokens}\n")
# Step 5: Stemming
stemmed_tokens = stem_tokens(tokens_no_stopwords)
print(f"Stemmed Tokens:\n{stemmed_tokens}\n")
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
12
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26
return stemmed_tokens
Output:
Original Text:
Natural Language Processing is a fascinating field of AI. It involves processing
human languages, like English!
Tokens:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', '
AI', '.', 'It', 'involves', 'processing', 'human', 'languages', ',', 'like', 'En
glish', '!']
Filtered Tokens:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', '
AI', 'It', 'involves', 'processing', 'human', 'languages', 'like', 'English']
Stemmed Tokens:
['natur', 'languag', 'process', 'fascin', 'field', 'ai', 'involv', 'process', 'h
uman', 'languag', 'like', 'english']
Q2. Demonstrate the N-gram modeling to analyze and establish the probability distribution across
sentences and explore the utilization of unigrams, bigrams, and trigrams in diverse English
sentences to illustrate the impact of varying n-gram orders on the calculated probabilities.
Program 2
import nltk
from nltk.util import ngrams
from collections import Counter, defaultdict
# Generate N-grams
def generate_ngrams(tokens, n):
return list(ngrams(tokens, n))
probabilities = defaultdict(float)
for ngram in ngram_counts:
if n == 1: # Unigrams
probabilities[ngram] = ngram_counts[ngram] / sum(ngram_counts.values())
else: # Bigrams or higher-order N-grams
probabilities[ngram] = ngram_counts[ngram] / n_minus_1_gram_counts[ngram[:-1]]
return probabilities
Output
1-Gram Probabilities:
the: 0.2500
quick: 0.0938
brown: 0.0312
fox: 0.1250
jumps: 0.0625
over: 0.0625
lazy: 0.0938
dog: 0.1250
barks: 0.0312
at: 0.0312
is: 0.0625
and: 0.0312
2-Gram Probabilities:
the quick: 0.2500
quick brown: 0.3333
brown fox: 1.0000
fox jumps: 0.2500
jumps over: 1.0000
over the: 1.0000
the lazy: 0.2500
lazy dog: 0.3333
the dog: 0.2500
dog barks: 0.2500
barks at: 1.0000
at the: 1.0000
the fox: 0.2500
fox is: 0.2500
is quick: 0.5000
quick and: 0.3333
and the: 1.0000
dog is: 0.2500
is lazy: 0.5000
quick dog: 0.3333
dog jumps: 0.2500
lazy fox: 0.3333
3-Gram Probabilities:
the quick brown: 0.5000
quick brown fox: 1.0000
brown fox jumps: 1.0000
fox jumps over: 1.0000
jumps over the: 1.0000
over the lazy: 1.0000
the lazy dog: 0.5000
the dog barks: 0.5000
dog barks at: 1.0000
barks at the: 1.0000
at the fox: 1.0000
the fox is: 0.5000
fox is quick: 1.0000
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
15
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26
Q3.Investigate the Minimum Edit Distance (MED) algorithm and its application in string comparison and the goal
is to understand how the algorithm efficiently computes the minimum number of edit operations required to
transform one string into another.
●Test the algorithm on strings with different type of variations (e.g., typos, substitutions, insertions, deletions)
●Evaluate its adaptability to different types of input variations
Program
return dp[m][n], dp
"""
Display the edit distance table in a readable format.
"""
print("\nEdit Distance Table:")
print(f" {' '.join('#' + c for c in ' ' + str2)}")
for i, row in enumerate(dp):
prefix = '#' if i == 0 else str1[i - 1]
print(f"{prefix} {' '.join(map(str, row))}")
# Run tests
test_med()
Output
# #l #a #w #n
# 0 1 2 3 4
f 1 1 2 3 4
l 2 1 2 3 4
a 3 2 1 2 3
w 4 3 2 1 2
Q4.
Write a program to implement top-down and bottom-up parser using appropriate context free
grammar.
Program:
class Parser:
def __init__(self, grammar):
self.grammar = grammar
self.rules = self._build_rules(grammar)
if non_terminal in self.rules:
for production in self.rules[non_terminal]:
new_index = index
new_tree = tree.copy()
for symbol in production:
if symbol in self.rules: # It's a non-terminal
result = self._top_down_helper(symbol, sentence, new_index, new_tree)
if result is None:
break
else:
new_index = result[0]
new_tree = result[1]
elif new_index < len(sentence) and sentence[new_index] == symbol: # It's a terminal
new_tree.append((symbol, sentence[new_index]))
new_index += 1
else:
break
word = sentence[0]
sentence = sentence[1:]
# Sample sentence
sentence = ['the', 'cat', 'chases', 'a', 'dog']
Output:
Q5.Given the following short movie reviews, each labeled with a genre, either comedy or action:
● fun, couple, love, love comedy
● fast, furious, shoot action
● couple, fly, fast, fun, fun comedy
● furious, shoot, shoot, fun action
● fly, fast, shoot, love action and
Program
# Step 4: Apply Add-1 smoothing to calculate the likelihoods for each word
def calc_likelihoods(class_word_counts, class_label):
total_words = sum(class_word_counts[class_label].values())
likelihoods = {}
for word in vocab:
# Add-1 smoothing
likelihoods[word] = (class_word_counts[class_label].get(word, 0) + 1) / (total_words + V)
return likelihoods
Output:
Posterior for comedy: 9.481481481481481e-05
Posterior for action: 0.00017146776406035664
Q6. Demonstrate the following using appropriate programming tool which illustrates the use of information
retrieval in NLP:
Program
Open a Python shell or a Jupyter notebook and run this code to download the missing corpus:
python
CopyEdit
import nltk
nltk.download('brown')
nltk.download('inaugural')
nltk.download('reuters')
nltk.download('udhr')
nltk.download('stopwords')
import nltk
from nltk.corpus import brown, inaugural, reuters, udhr
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk import pos_tag, word_tokenize
from nltk.tag import UnigramTagger, DefaultTagger
from nltk.corpus import stopwords
import string
# 8. Text Comparison (Find words from a given plain text without spaces)
# Define corpus to compare against
word_corpus = set(brown.words())
# Function to break the input text into valid words and score
def find_words_from_text(text, corpus):
words = []
score = 0
i=0
while i < len(text):
for j in range(i+1, len(text)+1):
if text[i:j] in corpus:
words.append(text[i:j])
score += 1
i=j
break
else:
i += 1
return words, score
Output:
Inaugural Corpus Raw Text: Fellow-Citizens of the Senate and of the House of Rep
resentatives:
Q7.Write a Python program to find synonyms and antonyms of the word "active" using WordNet
Program
import nltk
from nltk.corpus import wordnet
Output:
Synonyms of 'active': ['alive', 'participating', 'combat-ready', 'dynamic', 'act
ive', 'active_agent', 'fighting', 'active_voice']
Antonyms of 'active': ['inactive', 'extinct', 'passive', 'passive_voice', 'dorma
nt', 'quiet', 'stative']
8.Implement the machine translation application of NLP where it needs to train a machine translation
model for a language with limited parallel corpora. Investigate and incorporate techniques to improve
performance in low-resource scenarios.
Program
from transformers import MarianMTModel, MarianTokenizer
from datasets import load_dataset
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()