0% found this document useful (0 votes)
66 views30 pages

NLPs Lab Manual

The document is a lab manual for the Natural Language Processing (BAI601) course at Acharya Institute of Technology for the academic year 2025-26. It includes course details, objectives, outcomes, assessment methods, and a list of experiments related to NLP. The manual is prepared by Dr. S. Anu Pallavi and outlines the structure and expectations for students enrolled in the course.

Uploaded by

nayakaish008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views30 pages

NLPs Lab Manual

The document is a lab manual for the Natural Language Processing (BAI601) course at Acharya Institute of Technology for the academic year 2025-26. It includes course details, objectives, outcomes, assessment methods, and a list of experiments related to NLP. The manual is prepared by Dr. S. Anu Pallavi and outlines the structure and expectations for students enrolled in the course.

Uploaded by

nayakaish008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

ACHARYA INSTITUTE OF TECHNOLOGY


Acharya Dr. Sarvepalli Radhakrishnan Road, Soladevanahalli,
Bengaluru, Karnataka 560107
(Affiliated to VTU, Jnana Sangama, Belagavi)

Department of Artificial Intelligence and Machine Learning

NATURAL LANGUAGE PROCESSING LAB MANUAL


B.E - VI Semester

Subject Code - BAI601

Lab Manual – 2025-26


Name:

USN:

Branch: Section:

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


1
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

ACHARYA INSTITUTE OF TECHNOLOGY


Acharya Dr. Sarvepalli Radhakrishnan Road, Soladevanahalli,

Bengaluru, Karnataka 560107

(Affiliated to VTU, Jnana Sangama, Belagavi)

Department of Artificial Intelligence and Machine Learning

NATURAL LANGUAGE PROCESSING LAB MANUAL


B.E - VI Semester
Subject Code - BAI601

Prepared By:
Dr S.Anu Pallavi
Assistant Professor Grade-1, Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107

Reviewed By:
Dr. Nagendra J.
Associate Professor, Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107

Approved By:
Dr. Vijayshekhar S. S.
Head of the Department of Artificial Intelligence and Machine Learning,
AIT, Acharya, Bangalore -107

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


2
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

ACHARYA INSTITUTE OF TECHNOLOGY MOTTO

VISION
“Acharya Institute of Technology, committed to the cause of sustainable value-based education
in all disciplines, envisions itself as a global fountainhead of innovative human enterprise, with
inspirational initiatives for Academic Excellence”.

MISSION
“Acharya Institute of Technology strives to provide excellent academic ambiance to the students
for achieving global standards of technical education, foster intellectual and personal
development, meaningful research and ethical service to sustainable societal needs.”

Department of Computer Science and Engineering (Data Science)

VISION
To be recognized as the leader in the field of Artificial Intelligence and Machine Learning by
nurturing and producing quality next-generation academicians and researchers with human
values, who are creative, innovative, and versatile in this fast-growing field.

MISSION
The Department of Artificial Intelligence and Machine Learning (AI and ML) @ Acharya Institute
of Technology’s mission is to produce quality students with a sound understanding of the
fundamentals of the theory and practice of Artificial Intelligence and Machine Learning. The
mission is also to enable students to be leaders in the industry and academia nationally and
internationally. Finally, the mission is to meet the strong demands of the nation in the areas of
Artificial Intelligence and Machine Learning.

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


3
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)


PEO1: Students shall have a successful professional career in industry, academia, R & D
organization or entrepreneur in specialized fields of Computer Science and Engineering (Data
Science) and allied disciplines.
PEO2: Students shall be competent, creative and valued professionals in the chosen field.
PEO3: Engage in life-long learning and professional development.
PEO4: Become effective global collaborators, leading or participating to address technical,
business, environmental and societal challenges.

PROGRAM OUTCOMES (POs)

PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


4
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need

for sustainable development.


PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
O9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


5
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Table of Contents

Course Details Page No. 6-12

Do’s and Don’ts in the lab Page No. 8

Experiments List Page No. 13

Experiment Instructions Page No. 14-94

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


6
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

A. COURSE DETAILS & COURSE COORDINATOR DETAILS:


i) Course Details:

Academic
Course Title Course Code Core/Elective Semester
Year
NATURAL LANGUAGE
VI(AIML)
PROCESSING BAI601 AEC 2025-26

Contact Hours/week Lecture Tutorials Practical

2 0 - 2

ii) Course Administrator/Coordinator Details:

DETAILS OF THE FACULTY CO-ORDINATORS/COURSE ADMINISTRATORS


S. NAME OF THE DESIGNATIO N DEPART
E-MAIL-ID & CONTACT
N FACULTY ME
NUMBER
NT
EMAIL-ID:
Dr.AnuPallavi Assistant
1. AI/ML anupallavi2893@acharya.ac.in
Professor
CONTACT NUMBER: 9489380617
Grade-I

iii) Course Related Specific details:

List Of Prerequisites:

1. Knowledge in Basics in Deep Learning and in Neural Networks

viii) Do’s and Don’ts in the lab

DO’S
1. Please leave footwear outside the laboratory at the designated place.
2. Please keep your belongings such as bags in the designated place.
3. Turn off the respective systems and arrange the chairs before you
4. leaving the laboratory.
5. Maintain Silence and Discipline inside the laboratory.
6. Wear your ID card & Carry observation and record while coming to
7. the laboratory.

DON'TS
1. Don’t use mobile cell phones during lab hours.
2. Do not eat food, chew gum in the laboratory.
3. Do not install, uninstall or alter any software on the computer.
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
7
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

4. Students are not allowed to work in a laboratory alone or without the


5. presence of faculty or instructor.
6. Do not move any equipment from its original position.
7. We are not responsible for any belongings left behind.
8. Do not enter the laboratory without permission.

RBT
Course Outcomes:
Level
After the completion of the course, Students will be able to:

CO1 ● Understand what is NLP L1

CO2 ● Should know about the parsing the tokens in NLP L4

CO3 ● Should know semantic Analysis in NLP L3

● Should know Sentiment Analysis in NLP


CO4 L1

CO5 ● Should know how to generate the output based on the above concepts L4

v. CO-PO-PSO Mapping:
Course Outcomes - Program Outcomes – Program Specific Outcomes mapping:

Program Specific
Program Outcomes
Outcomes
COs
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3

CO-1 3 - 2 - 3 - - - - - - - - - 3

CO-2 3 - 2 - 3 - - - - - - - - - 3

CO-3 3 - 2 - 3 - - - 2 - - - - - 3

CO-4 3 - 2 - 3 - - - - - - 2 - - 3

CO-5 3 - 2 - 3 - - - - - - - - - 3

vi. Course Outcomes/Program Outcomes assessment methods:


Course Assessment Procedure:

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


8
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Procedure for Internal Assessment : 2-IA-


Tests + Records+Attendance

Maximum Marks for Internal Assessment : 50-Marks

Maximum Marks for Final Exam : 50-Marks

Assessment Tools Weightage Frequency Responsibility


Continuous
Internal Lab Internal Department
50% Twice in a semester
Evaluation Assessment level
Direct (CIE)
Assessment
Department
Semester End Exam (SEE) 50% Once in a semester
level

Course Outcome Attainment Computation


Marks scored by
1% to 55% 56 % to 80 % 81% to 100%
students
Weightage 1 2 3
Target attainment = 75%

B. COURSE PLAN:

Course plan/Lesson Plan

Hr.
No Experiment Content
.
1 1
Write a Python program for the following preprocessing of text in NLP:
● Tokenization
● Filtration
● Script Validation
● Stop Word Removal
● Stemming the given the code

Demonstrate the N-gram modeling to analyze and establish the probability


2 2 distribution across sentences and explore the utilization of unigrams, bigrams,
and trigrams in diverse English sentences to illustrate the impact of varying n-
gram orders on the calculated probabilities.
Investigate the Minimum Edit Distance (MED) algorithm and its application in string
3 3
comparison and the goal is to understand how the algorithm efficiently computes the
minimum number of edit operations required to transform one string into another.
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
9
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Write a program to implement top-down and bottom-up parser using


4 4
appropriate context free grammar.

Given the following short movie reviews, each labeled with a genre, either
5 5
comedy or action:
• fun, couple, love, love comedy
• fast, furious, shoot action
• couple, fly, fast, fun, fun comedy
• furious, shoot, shoot, fun action
• fly, fast, shoot, love action and
A new document D: fast, couple, shoot, fly
Compute the most likely class for D. Assume a Naive Bayes classifier and use
add-1 smoothing for the likelihoods.

Demonstrate the following using appropriate programming tool which illustrates the
6 6
use of information retrieval in NLP:
● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various
methods like filelds, raw, words, sents, categories
● Create and use your own corpora (plaintext, categorical)
● Study Conditional frequency distributions
● Study of tagged corpora with methods like tagged_sents, tagged_words
● Write a program to find the most frequent noun tags
● Map Words to Properties Using Python Dictionaries
● Study Rule based tagger, Unigram Tagger
Find different words from a given plain text without any space by comparing this text
with a given corpus of words. Also find the score of words.

7 7 Write a Python program to find synonyms and antonyms of the word "active"
using WordNet.

8 8 Implement the machine translation application of NLP where it needs to train a


machine translation model for a language with limited parallel corpora.
Investigate and incorporate techniques to improve performance in low-resource
scenarios.

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


10
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

NOTES:

Natural Language Processing

What is Natural Language Processing ?

NLP stands for Natural Language Processing. It’s a field of artificial intelligence (AI)
focused on the interaction between computers and human (natural) languages. The goal of
NLP is to enable computers to understand, interprete, and generate human language in a way
that is both meaningful and useful.

What are the applications of Natural Language Processing?

i)Speech recognition (like voice assistants, e.g., Siri or Alexa)

ii)Text to speech translation

iii)Chatbots

iv)Text summarization and information extraction

Q1. Write a Python program for the following preprocessing of text in NLP:
* Tokenization
* Filtration
* Script Validation
* Stop Word Removal
* Stemming give the code

Program 1:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary NLTK data files


nltk.download('punkt')
nltk.download('stopwords')

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


11
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

# Sample text for preprocessing


text = """Natural Language Processing is a fascinating field of AI. It involves processing human languages, like
English!"""

# 1. Tokenization
def tokenize(text):
return word_tokenize(text)

# 2. Filtration (Remove special characters, numbers, etc.)


def filter_text(tokens):
filtered_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens] # Remove special characters
filtered_tokens = [token for token in filtered_tokens if token.isalpha()] # Keep only alphabetic words
return filtered_tokens

# 3. Script Validation (e.g., check if text is in English using regex)


def validate_script(tokens):
english_tokens = [token for token in tokens if re.match(r'^[a-zA-Z]+$', token)]
return english_tokens

# 4. Stop Word Removal


def remove_stopwords(tokens):
stop_words = set(stopwords.words('english'))
return [token for token in tokens if token.lower() not in stop_words]

# 5. Stemming
def stem_tokens(tokens):
stemmer = PorterStemmer()
return [stemmer.stem(token) for token in tokens]

# Pipeline for text preprocessing


def preprocess_text(text):
print(f"Original Text:\n{text}\n")

# Step 1: Tokenization
tokens = tokenize(text)
print(f"Tokens:\n{tokens}\n")

# Step 2: Filtration
filtered_tokens = filter_text(tokens)
print(f"Filtered Tokens:\n{filtered_tokens}\n")

# Step 3: Script Validation


validated_tokens = validate_script(filtered_tokens)
print(f"Validated Tokens (English Only):\n{validated_tokens}\n")

# Step 4: Stop Word Removal


tokens_no_stopwords = remove_stopwords(validated_tokens)
print(f"Tokens after Stop Word Removal:\n{tokens_no_stopwords}\n")

# Step 5: Stemming
stemmed_tokens = stem_tokens(tokens_no_stopwords)
print(f"Stemmed Tokens:\n{stemmed_tokens}\n")
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
12
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

return stemmed_tokens

# Run the preprocessing pipeline


preprocessed_tokens = preprocess_text(text)

Output:
Original Text:
Natural Language Processing is a fascinating field of AI. It involves processing
human languages, like English!

Tokens:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', '
AI', '.', 'It', 'involves', 'processing', 'human', 'languages', ',', 'like', 'En
glish', '!']

Filtered Tokens:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', '
AI', 'It', 'involves', 'processing', 'human', 'languages', 'like', 'English']

Validated Tokens (English Only):


['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', '
AI', 'It', 'involves', 'processing', 'human', 'languages', 'like', 'English']

Tokens after Stop Word Removal:


['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', 'involves',
'processing', 'human', 'languages', 'like', 'English']

Stemmed Tokens:
['natur', 'languag', 'process', 'fascin', 'field', 'ai', 'involv', 'process', 'h
uman', 'languag', 'like', 'english']

Q2. Demonstrate the N-gram modeling to analyze and establish the probability distribution across
sentences and explore the utilization of unigrams, bigrams, and trigrams in diverse English
sentences to illustrate the impact of varying n-gram orders on the calculated probabilities.

Program 2

import nltk
from nltk.util import ngrams
from collections import Counter, defaultdict

# Download necessary NLTK data files


nltk.download('punkt')

# Sample sentences for N-gram modeling


sentences = [
"The quick brown fox jumps over the lazy dog",
"The dog barks at the fox",
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
13
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

"The fox is quick and the dog is lazy",


"The quick dog jumps over the lazy fox"
]

# Preprocess and tokenize sentences


def preprocess(sentences):
tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]
return tokenized_sentences

# Generate N-grams
def generate_ngrams(tokens, n):
return list(ngrams(tokens, n))

# Calculate probabilities for N-grams


def calculate_ngram_probabilities(tokenized_sentences, n):
ngram_counts = Counter()
n_minus_1_gram_counts = Counter()

for tokens in tokenized_sentences:


ngram_counts.update(generate_ngrams(tokens, n))
if n > 1:
n_minus_1_gram_counts.update(generate_ngrams(tokens, n - 1))

probabilities = defaultdict(float)
for ngram in ngram_counts:
if n == 1: # Unigrams
probabilities[ngram] = ngram_counts[ngram] / sum(ngram_counts.values())
else: # Bigrams or higher-order N-grams
probabilities[ngram] = ngram_counts[ngram] / n_minus_1_gram_counts[ngram[:-1]]

return probabilities

# Display N-gram probabilities


def display_ngram_probabilities(probabilities, n):
print(f"\n{n}-Gram Probabilities:")
for ngram, prob in probabilities.items():
print(f"{' '.join(ngram)}: {prob:.4f}")

# Main function to analyze N-grams


def analyze_ngrams(sentences):
tokenized_sentences = preprocess(sentences)
for n in [1, 2, 3]: # Unigram, Bigram, Trigram
probabilities = calculate_ngram_probabilities(tokenized_sentences, n)
display_ngram_probabilities(probabilities, n)

# Run the analysis


analyze_ngrams(sentences)

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


14
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Output

1-Gram Probabilities:
the: 0.2500
quick: 0.0938
brown: 0.0312
fox: 0.1250
jumps: 0.0625
over: 0.0625
lazy: 0.0938
dog: 0.1250
barks: 0.0312
at: 0.0312
is: 0.0625
and: 0.0312

2-Gram Probabilities:
the quick: 0.2500
quick brown: 0.3333
brown fox: 1.0000
fox jumps: 0.2500
jumps over: 1.0000
over the: 1.0000
the lazy: 0.2500
lazy dog: 0.3333
the dog: 0.2500
dog barks: 0.2500
barks at: 1.0000
at the: 1.0000
the fox: 0.2500
fox is: 0.2500
is quick: 0.5000
quick and: 0.3333
and the: 1.0000
dog is: 0.2500
is lazy: 0.5000
quick dog: 0.3333
dog jumps: 0.2500
lazy fox: 0.3333

3-Gram Probabilities:
the quick brown: 0.5000
quick brown fox: 1.0000
brown fox jumps: 1.0000
fox jumps over: 1.0000
jumps over the: 1.0000
over the lazy: 1.0000
the lazy dog: 0.5000
the dog barks: 0.5000
dog barks at: 1.0000
barks at the: 1.0000
at the fox: 1.0000
the fox is: 0.5000
fox is quick: 1.0000
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
15
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

is quick and: 1.0000


quick and the: 1.0000
and the dog: 1.0000
the dog is: 0.5000
dog is lazy: 1.0000
the quick dog: 0.5000
quick dog jumps: 1.0000
dog jumps over: 1.0000
the lazy fox: 0.5000

Q3.Investigate the Minimum Edit Distance (MED) algorithm and its application in string comparison and the goal
is to understand how the algorithm efficiently computes the minimum number of edit operations required to
transform one string into another.
●Test the algorithm on strings with different type of variations (e.g., typos, substitutions, insertions, deletions)
●Evaluate its adaptability to different types of input variations

Program

def minimum_edit_distance(str1, str2):


"""
Compute the Minimum Edit Distance (MED) between two strings using dynamic programming.
"""
m, n = len(str1), len(str2)

# Initialize the DP table


dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

# Base case initialization


for i in range(m + 1):
dp[i][0] = i # Cost of deletions
for j in range(n + 1):
dp[0][j] = j # Cost of insertions

# Fill the DP table


for i in range(1, m + 1):
for j in range(1, n + 1):
if str1[i - 1] == str2[j - 1]: # No cost if characters match
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = 1 + min(
dp[i - 1][j], # Deletion
dp[i][j - 1], # Insertion
dp[i - 1][j - 1] # Substitution
)

return dp[m][n], dp

def display_edit_distance_table(dp, str1, str2):

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


16
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

"""
Display the edit distance table in a readable format.
"""
print("\nEdit Distance Table:")
print(f" {' '.join('#' + c for c in ' ' + str2)}")
for i, row in enumerate(dp):
prefix = '#' if i == 0 else str1[i - 1]
print(f"{prefix} {' '.join(map(str, row))}")

# Test the MED algorithm with variations


def test_med():
test_cases = [
("kitten", "sitting"), # Substitution, insertion, deletion
("flaw", "lawn"), # Substitution and transposition
("intention", "execution"), # Complex sequence
("apple", "aple"), # Single deletion
("", "hello"), # All insertions
("hello", ""), # All deletions
]

for str1, str2 in test_cases:


print(f"\nComparing: '{str1}' -> '{str2}'")
med, dp = minimum_edit_distance(str1, str2)
print(f"Minimum Edit Distance: {med}")
display_edit_distance_table(dp, str1, str2)

# Run tests
test_med()

Output

Comparing: 'kitten' -> 'sitting'


Minimum Edit Distance: 3

Edit Distance Table:


# #s #i #t #t #i #n #g
# 0 1 2 3 4 5 6 7
k 1 1 2 3 4 5 6 7
i 2 2 1 2 3 4 5 6
t 3 3 2 1 2 3 4 5
t 4 4 3 2 1 2 3 4
e 5 5 4 3 2 2 3 4
n 6 6 5 4 3 3 2 3

Comparing: 'flaw' -> 'lawn'


Minimum Edit Distance: 2

Edit Distance Table:

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


17
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

# #l #a #w #n
# 0 1 2 3 4
f 1 1 2 3 4
l 2 1 2 3 4
a 3 2 1 2 3
w 4 3 2 1 2

Comparing: 'intention' -> 'execution'


Minimum Edit Distance: 5

Edit Distance Table:


# #e #x #e #c #u #t #i #o #n
# 0 1 2 3 4 5 6 7 8 9
i 1 1 2 3 4 5 6 6 7 8
n 2 2 2 3 4 5 6 7 7 7
t 3 3 3 3 4 5 5 6 7 8
e 4 3 4 3 4 5 6 6 7 8
n 5 4 4 4 4 5 6 7 7 7
t 6 5 5 5 5 5 5 6 7 8
i 7 6 6 6 6 6 6 5 6 7
o 8 7 7 7 7 7 7 6 5 6
n 9 8 8 8 8 8 8 7 6 5

Comparing: 'apple' -> 'aple'


Minimum Edit Distance: 1

Edit Distance Table:


# #a #p #l #e
# 0 1 2 3 4
a 1 0 1 2 3
p 2 1 0 1 2
p 3 2 1 1 2
l 4 3 2 1 2
e 5 4 3 2 1

Comparing: '' -> 'hello'


Minimum Edit Distance: 5

Edit Distance Table:


# #h #e #l #l #o
# 0 1 2 3 4 5

Comparing: 'hello' -> ''


Minimum Edit Distance: 5

Edit Distance Table:


#
# 0
h 1
e 2
l 3
l 4
o 5

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


18
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Q4.
Write a program to implement top-down and bottom-up parser using appropriate context free
grammar.

Program:

class Parser:
def __init__(self, grammar):
self.grammar = grammar
self.rules = self._build_rules(grammar)

def _build_rules(self, grammar):


rules = {}
for lhs, rhs in grammar:
if lhs not in rules:
rules[lhs] = []
rules[lhs].append(rhs)
return rules

def top_down_parse(self, sentence):


# Start parsing from 'S' (start symbol)
return self._top_down_helper('S', sentence, 0, [])

def _top_down_helper(self, non_terminal, sentence, index, tree):


if index == len(sentence):
return None # End of sentence, no more matching

if non_terminal in self.rules:
for production in self.rules[non_terminal]:
new_index = index
new_tree = tree.copy()
for symbol in production:
if symbol in self.rules: # It's a non-terminal
result = self._top_down_helper(symbol, sentence, new_index, new_tree)
if result is None:
break

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


19
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

else:
new_index = result[0]
new_tree = result[1]
elif new_index < len(sentence) and sentence[new_index] == symbol: # It's a terminal
new_tree.append((symbol, sentence[new_index]))
new_index += 1
else:
break

if new_index == len(sentence) and len(new_tree) > 0:


return (new_index, new_tree)
return None

def bottom_up_parse(self, sentence):


# Start from the words in the sentence and reduce upwards
return self._bottom_up_helper(sentence, [])

def _bottom_up_helper(self, sentence, stack):


if len(sentence) == 0:
return stack if stack else None

word = sentence[0]
sentence = sentence[1:]

# Try to reduce the current word to a non-terminal symbol


for lhs, rhs_list in self.rules.items():
for rhs in rhs_list:
if rhs == [word]:
stack.append(lhs)
break

# Now try to apply reductions on the stack


for lhs, rhs_list in self.rules.items():
for rhs in rhs_list:
if len(stack) >= len(rhs) and stack[-len(rhs):] == rhs:
stack = stack[:-len(rhs)] + [lhs]

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


20
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

return self._bottom_up_helper(sentence, stack)

# Define a simple grammar


grammar = [
('S', ['NP', 'VP']),
('NP', ['Det', 'N']),
('VP', ['V', 'NP']),
('VP', ['V']),
('Det', ['a']),
('Det', ['the']),
('N', ['cat']),
('N', ['dog']),
('V', ['chases']),
('V', ['eats']),
]

# Initialize the parser with the grammar


parser = Parser(grammar)

# Sample sentence
sentence = ['the', 'cat', 'chases', 'a', 'dog']

# Parse the sentence using top-down parser


top_down_result = parser.top_down_parse(sentence)
print("Top-Down Parse Result:", top_down_result)

# Parse the sentence using bottom-up parser


bottom_up_result = parser.bottom_up_parse(sentence)
print("Bottom-Up Parse Result:", bottom_up_result)

Output:

Top-Down Parse Result: None


Bottom-Up Parse Result: ['NP', 'VP', 'NP']

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


21
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Q5.Given the following short movie reviews, each labeled with a genre, either comedy or action:
● fun, couple, love, love comedy
● fast, furious, shoot action
● couple, fly, fast, fun, fun comedy
● furious, shoot, shoot, fun action
● fly, fast, shoot, love action and

A new document D: fast, couple, shoot, fly


Compute the most likely class for D. Assume a Naive Bayes classifier and use add-1 smoothing for the
likelihoods.

Program

from collections import defaultdict

# Define the dataset (reviews)


reviews = [
("fun, couple, love, love", "comedy"),
("fast, furious, shoot", "action"),
("couple, fly, fast, fun", "comedy"),
("furious, shoot, shoot, fun", "action"),
("fly, fast, shoot, love", "action")
]

# Preprocess the reviews into word counts for each class


class_word_counts = defaultdict(lambda: defaultdict(int))
class_counts = defaultdict(int)
vocab = set()

# Step 1: Preprocess reviews and update word counts


for review, label in reviews:
words = review.split(", ")
class_counts[label] += 1
for word in words:
class_word_counts[label][word] += 1
vocab.add(word)

# Step 2: Vocabulary size


V = len(vocab)

# Step 3: Calculate prior probabilities (P(comedy) and P(action))


prior_comedy = class_counts["comedy"] / sum(class_counts.values())
prior_action = class_counts["action"] / sum(class_counts.values())

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


22
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

# Step 4: Apply Add-1 smoothing to calculate the likelihoods for each word
def calc_likelihoods(class_word_counts, class_label):
total_words = sum(class_word_counts[class_label].values())
likelihoods = {}
for word in vocab:
# Add-1 smoothing
likelihoods[word] = (class_word_counts[class_label].get(word, 0) + 1) / (total_words + V)
return likelihoods

likelihood_comedy = calc_likelihoods(class_word_counts, "comedy")


likelihood_action = calc_likelihoods(class_word_counts, "action")

# Step 5: Classify the new document D using Naive Bayes

new_document = "fast, couple, shoot, fly"


new_words = new_document.split(", ")

# Calculate posterior probabilities for both classes


posterior_comedy = prior_comedy
posterior_action = prior_action

for word in new_words:


posterior_comedy *= likelihood_comedy[word]
posterior_action *= likelihood_action[word]

# Step 6: Print results and classify the document


print(f"Posterior for comedy: {posterior_comedy}")
print(f"Posterior for action: {posterior_action}")

if posterior_comedy > posterior_action:


print("\nThe most likely class for the document is: Comedy")
else:
print("\nThe most likely class for the document is: Action")

Output:
Posterior for comedy: 9.481481481481481e-05
Posterior for action: 0.00017146776406035664

The most likely class for the document is: Action

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


23
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


24
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

Q6. Demonstrate the following using appropriate programming tool which illustrates the use of information
retrieval in NLP:
Program

Open a Python shell or a Jupyter notebook and run this code to download the missing corpus:

python
CopyEdit
import nltk
nltk.download('brown')
nltk.download('inaugural')
nltk.download('reuters')
nltk.download('udhr')
nltk.download('stopwords')

import nltk
from nltk.corpus import brown, inaugural, reuters, udhr
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk import pos_tag, word_tokenize
from nltk.tag import UnigramTagger, DefaultTagger
from nltk.corpus import stopwords
import string

# 1. Explore Different Corpora


# Accessing Brown corpus
print("Brown Corpus Categories:", brown.categories())
print("Brown Corpus Words:", brown.words(categories="news")[:10]) # First 10 words in 'news' category
print("Brown Corpus Sentences:", brown.sents(categories="news")[:2]) # First 2 sentences in 'news' category

# Accessing Inaugural corpus


print("\nInaugural Corpus Raw Text:", inaugural.raw('1789-Washington.txt')[:100])

# Accessing Reuters corpus


print("\nReuters Corpus Categories:", reuters.categories())
print("Reuters Corpus Words:", reuters.words(categories="trade")[:10])

# Accessing UDHR corpus (Universal Declaration of Human Rights)


print("\nUDHR Corpus Languages:", udhr.languages())
print("UDHR English Words:", udhr.words('English')[:10])

# 2. Create and Use Custom Corpora


# Create a simple plaintext corpus
plaintext_corpus = ["This is a simple corpus", "It contains simple sentences"]
nltk.data.path.append('corpus')
with open('my_corpus.txt', 'w') as file:
for line in plaintext_corpus:
file.write(line + '\n')

# Load and process custom corpus


with open('my_corpus.txt', 'r') as file:
custom_corpus = file.read()
print("\nCustom Corpus Words:", word_tokenize(custom_corpus)[:5])

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


25
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

# 3. Conditional Frequency Distributions


# Calculate the frequency distribution of words in the Brown corpus based on categories
cfd = ConditionalFreqDist(
(category, word)
for category in brown.categories()
for word in brown.words(categories=category)
)
print("\nConditional Frequency Distribution Example (Category 'news'):", cfd["news"].most_common(5))

# 4. Study Tagged Corpora


# Use the Brown corpus tagged with parts of speech
tagged_sents = brown.tagged_sents(categories='news')
print("\nTagged Sentences from Brown Corpus:", tagged_sents[:2])

# Use tagged words from the Brown corpus


tagged_words = brown.tagged_words(categories='news')
print("\nTagged Words from Brown Corpus:", tagged_words[:10])

# 5. Find Most Frequent Noun Tags in Tagged Sentences


# Find most frequent noun tags (NN, NNS, etc.) in the tagged corpus
noun_tags = [tag for word, tag in tagged_words if tag.startswith('NN')]
freq_dist_nouns = FreqDist(noun_tags)
print("\nMost Frequent Noun Tags:", freq_dist_nouns.most_common(5))

# 6. Map Words to Properties Using Python Dictionaries


# Example: Map words to their frequencies in a given corpus
word_map = {}
for word in brown.words():
word_map[word] = word_map.get(word, 0) + 1
print("\nWord Frequency Map (first 5 words):", dict(list(word_map.items())[:5]))

# 7. Rule-Based Tagger and Unigram Tagger


# Define a simple rule-based tagger (using regular expressions)
patterns = [
(r'.*ed$', 'VBD'), # Verbs ending in 'ed'
(r'.*ing$', 'VBG'), # Verbs ending in 'ing'
(r'\b\w{3,}\b', 'NN'), # Nouns
]
rule_based_tagger = nltk.RegexpTagger(patterns)

# Unigram Tagger (trainable with tagged data)


unigram_tagger = UnigramTagger(tagged_sents)

# Example tagging with both taggers


print("\nRule-Based Tagger Example:", rule_based_tagger.tag(word_tokenize("I am playing football")))
print("Unigram Tagger Example:", unigram_tagger.tag(word_tokenize("I am playing football")))

# 8. Text Comparison (Find words from a given plain text without spaces)
# Define corpus to compare against
word_corpus = set(brown.words())

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


26
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

# Given plain text without spaces


input_text = "fastcoupleshootfly"

# Function to break the input text into valid words and score
def find_words_from_text(text, corpus):
words = []
score = 0
i=0
while i < len(text):
for j in range(i+1, len(text)+1):
if text[i:j] in corpus:
words.append(text[i:j])
score += 1
i=j
break
else:
i += 1
return words, score

# Find words and score


words, score = find_words_from_text(input_text, word_corpus)
print("\nWords found:", words)
print("Score:", score)

Output:

Brown Corpus Categories: ['adventure', 'belles_lettres', 'editorial', 'fiction',


'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religio
n', 'reviews', 'romance', 'science_fiction']
Brown Corpus Words: ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday
', 'an', 'investigation', 'of']
Brown Corpus Sentences: [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'F
riday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election
', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 't
ook', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'prese
ntments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had',
'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'pr
aise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the'
, 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']]

Inaugural Corpus Raw Text: Fellow-Citizens of the Senate and of the House of Rep
resentatives:

Among the vicissitudes incident

Reuters Corpus Categories: ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-o


il', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn'
, 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'f
uel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog'
, 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'j
obs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed',
'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat
', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'plat
inum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail'
Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
27
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

, 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', '


soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', '
tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
Reuters Corpus Words: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.',
'S', '.-', 'JAPAN']

Q7.Write a Python program to find synonyms and antonyms of the word "active" using WordNet
Program
import nltk
from nltk.corpus import wordnet

# Ensure that WordNet data is downloaded


nltk.download('wordnet')
nltk.download('omw-1.4')

# Function to find synonyms and antonyms


def find_synonyms_antonyms(word):
# Synonyms list
synonyms = set()
# Antonyms list
antonyms = set()

# Get synsets for the word


for syn in wordnet.synsets(word):
# Add synonyms (lemmas) to the synonyms set
for lemma in syn.lemmas():
synonyms.add(lemma.name())
# Check for antonyms and add them
if lemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())

return list(synonyms), list(antonyms)

# Find synonyms and antonyms for the word 'active'


synonyms, antonyms = find_synonyms_antonyms('active')

print("Synonyms of 'active':", synonyms)


Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107
28
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

print("Antonyms of 'active':", antonyms)

Output:
Synonyms of 'active': ['alive', 'participating', 'combat-ready', 'dynamic', 'act
ive', 'active_agent', 'fighting', 'active_voice']
Antonyms of 'active': ['inactive', 'extinct', 'passive', 'passive_voice', 'dorma
nt', 'quiet', 'stative']

8.Implement the machine translation application of NLP where it needs to train a machine translation
model for a language with limited parallel corpora. Investigate and incorporate techniques to improve
performance in low-resource scenarios.

Program
from transformers import MarianMTModel, MarianTokenizer
from datasets import load_dataset

# Load the pre-trained MarianMT model and tokenizer


model_name = 'Helsinki-NLP/opus-mt-en-de' # Example for English to German
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Load the dataset (replace with your parallel corpus)


dataset = load_dataset('your_parallel_dataset')

# Tokenize the data


def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length')

train_dataset = dataset['train'].map(tokenize_function, batched=True)

# Fine-tune the model on the low-resource corpus


from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


29
NATURAL LANGUAGE PROCESSING (BAI601) 2025-26

learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)

trainer.train()

Department of AI&ML , Acharya Institute of Technology, Bengaluru-560107


30

You might also like