0% found this document useful (0 votes)
44 views

Natural language processing notes

Natural Language Processing (NLP) is a field of Artificial Intelligence that enables computers to understand and generate human language, combining linguistics, computer science, and machine learning. It encompasses various tasks such as text classification, sentiment analysis, and machine translation, with applications in customer support, healthcare, and education. Despite its advancements, NLP faces challenges like language ambiguity, sarcasm detection, and the need for large datasets.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Natural language processing notes

Natural Language Processing (NLP) is a field of Artificial Intelligence that enables computers to understand and generate human language, combining linguistics, computer science, and machine learning. It encompasses various tasks such as text classification, sentiment analysis, and machine translation, with applications in customer support, healthcare, and education. Despite its advancements, NLP faces challenges like language ambiguity, sarcasm detection, and the need for large datasets.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Natural language

processing
Unit I
🧠 Natural Language Processing (NLP)

📌 Definition:

Natural Language Processing (NLP) is a branch of Artificial Intelligence


(AI) that enables computers to understand, interpret, and generate
human language.

It combines:

 Linguistics

 Computer Science

 Machine Learning

🌍 Scope of NLP

NLP allows machines to interact with human language for a wide variety of
tasks:

Core Areas in NLP:

Area Description

Text Classification Categorizing texts (e.g., spam vs. non-spam)

Identifying emotions in text


Sentiment Analysis
(positive/negative/neutral)

Machine Translation Translating text from one language to another

Speech Recognition Converting spoken language to text

Text Summarization Creating concise summaries from large text

Question Answering Systems that can answer user queries (like


Area Description

ChatGPT)

Named Entity Identifying names, places, organizations in


Recognition (NER) text

💡 Applications in Various Domains

1. 📱 Customer Support

 Chatbots and virtual assistants (e.g., Alexa, Siri)

 Automated replies and email sorting

2. 📈 Business Intelligence

 Analyzing customer feedback, reviews, and surveys

 Market sentiment analysis

3. 🏥 Healthcare

 Analyzing clinical notes and health records

 Extracting drug names, diagnoses, and treatments

4. 📰 Media and Journalism

 Auto-generating news summaries

 Detecting fake news

5. ⚖️Legal and Compliance

 Parsing legal documents

 Contract analysis

6. 🛒 E-Commerce

 Product recommendations based on reviews

 Search query understanding and auto-completion

7. 🎓 Education

 Automated essay scoring

 Language learning apps (e.g., Duolingo)


⚠️Challenges and Limitations of NLP

Despite its success, NLP faces several challenges:

1. Ambiguity of Language

 Words have multiple meanings based on context (e.g., “bank” –


riverbank or financial institution?).

2. Sarcasm & Irony

 Machines struggle to detect sarcasm or emotional undertones.

3. Language Diversity

 Thousands of languages, dialects, and scripts make universal NLP hard.

4. Data Dependency

 Requires huge labeled datasets for training models.

 Data scarcity for low-resource languages.

5. Grammar and Syntax Complexity

 Natural languages have inconsistent grammar rules.

 Informal language, typos, slang complicate processing.

6. Context Understanding

 Hard to understand context across multiple sentences or long


documents.

7. Bias and Fairness

 NLP models can learn and reflect biases present in training data
(gender, racial, political, etc.).

🚧 Limitations of Current NLP Systems:

 Often lack true understanding – they work statistically, not


semantically.

 Struggle with reasoning, common sense, and logic-based tasks.


 Require retraining for domain-specific vocabulary (e.g., medical vs.
legal terms).

1. Syntax in NLP

Syntax refers to the structure of language — how words are combined to


form grammatically correct sentences. It's about rules and grammar.

🔧 NLP Tasks in Syntax:

Task Description

Part-of-Speech Assigns word types like noun, verb, adjective, etc. to


(POS) Tagging each word in a sentence.

Parsing (Syntactic Analyzes the grammatical structure of a sentence and


Analysis) builds a parse tree.

Chunking Groups words into meaningful phrases (noun phrases,


(Shallow Parsing) verb phrases, etc.).

Sentence
Splits a paragraph into individual sentences.
Segmentation

Breaks words into morphemes (smallest units of


Morphological
meaning). Example: “unhappiness” = un + happy +
Analysis
ness

✅ Example:

Sentence: "The cat sat on the mat."

 POS Tags: [The/DET] [cat/NOUN] [sat/VERB] [on/PREP] [the/DET]


[mat/NOUN]

 Parse Tree: Shows how sentence components are nested.

🧠 2. Semantics in NLP

Semantics is the study of meaning in language — understanding what the


sentence actually conveys.

🔧 NLP Tasks in Semantics:

Task Description

Word Sense Identifies the correct meaning of a word in context


Task Description

Disambiguation
(e.g., "bank" as riverbank or financial?).
(WSD)

Named Entity Identifies entities like names, places, organizations,


Recognition (NER) etc.

Semantic Role Determines the roles of words in a sentence (who


Labeling (SRL) did what to whom).

Checks if one sentence logically follows from


Textual Entailment
another.

Measures how close the meaning of two sentences


Semantic Similarity
or words is.

✅ Example:

Sentence: "Apple launched a new iPhone."

 NER: Apple → Organization, iPhone → Product

 SRL: Agent (Apple), Action (launched), Object (iPhone)

🧠 3. Pragmatics in NLP

Pragmatics focuses on how context and world knowledge affect


language interpretation. It goes beyond literal meaning.

🔧 NLP Tasks in Pragmatics:

Task Description

Coreference Finds what a pronoun or noun phrase refers to (e.g.,


Resolution "John said he will come" → "he" = John).

Discourse Understands the flow of meaning across multiple


Analysis sentences or paragraphs.

Dialogue Handles conversations in chatbots or virtual assistants


Management by keeping track of context.

Infers what the user wants (e.g., asking for weather,


Intent Detection
booking a flight).
Task Description

Sarcasm & Irony


Identifies non-literal or humorous use of language.
Detection

✅ Example:

Conversation:

 User: "I’m freezing!"

 Bot: (Understands the context means the user is cold and suggests
turning on the heater)

🎯 Summary Table

Level Focus Key NLP Tasks

Syntax Structure POS Tagging, Parsing, Chunking

Semantic
Meaning WSD, NER, SRL, Entailment
s

Pragmati Context & Coreference, Discourse, Dialogue, Intent


cs Intent Detection

📘 1. Boolean Model

✅ Definition:

The Boolean model represents documents and queries as a set of terms


(keywords) and uses Boolean logic (AND, OR, NOT) to retrieve
documents that exactly match the query conditions.

📚 How it Works:

 Each document is represented as a binary vector of terms.

 A term is either present (1) or absent (0).

 Queries are written as Boolean expressions.

🔍 Example:

Query: "AI AND Healthcare NOT Robotics"

Only documents that contain both “AI” and “Healthcare” but not “Robotics”
will be returned.
🧠 Advantages:

 Simple and easy to implement.

 Results are unambiguous.

⚠️Limitations:

 No ranking of documents (all matches are equal).

 Does not handle partial relevance.

 No concept of term frequency or document similarity.

📘 2. Vector Space Model (VSM)

✅ Definition:

The Vector Model represents documents and queries as vectors in an n-


dimensional space, where each dimension corresponds to a term. Relevance
is measured using cosine similarity between vectors.

📚 How it Works:

 Documents and queries are represented as TF-IDF vectors.

 Cosine similarity is used to calculate the angle between query and


document vectors.

 The smaller the angle, the more relevant the document.

📈 Formula:

cosine_similarity=d⃗⋅q⃗∥d⃗∥⋅∥q⃗∥\text{cosine\_similarity} = \frac{\vec{d} \cdot \


vec{q}}{\| \vec{d} \| \cdot \| \vec{q} \|}cosine_similarity=∥d∥⋅∥q∥d⋅q

Where:

 d⃗\vec{d}d = Document vector

 q⃗\vec{q}q = Query vector

🔍 Example:

If a query is "Machine Learning" and Document A has those terms with high
weights, it will rank higher than Document B which only mentions "Learning".

🧠 Advantages:
 Supports ranking of documents.

 Measures partial matches.

 Easy to implement using linear algebra.

⚠️Limitations:

 Ignores term dependencies (word order).

 Can't handle uncertainty or noise in meaning.

 Performance drops in very large corpora without optimization.

📘 3. Probabilistic Model

✅ Definition:

The Probabilistic Model estimates the probability that a document is


relevant to a given query. It ranks documents based on this probability.

📚 How it Works:

 Based on Bayes' Theorem.

 For each document DDD, calculate:

P(R=1∣D,Q)→ Probability of relevanceP(R=1|D, Q) \quad \text{→ Probability of


relevance}P(R=1∣D,Q)→ Probability of relevance

 The system ranks documents by decreasing order of probability.

🔍 Example Models:

 Binary Independence Model (BIM)

 BM25 (Best Matching 25) – one of the most popular probabilistic IR


models.

🧠 Advantages:

 Provides statistical foundation.

 Can be tuned using relevance feedback.

 More flexible and accurate than Boolean or basic Vector models.

⚠️Limitations:

 Requires training data or assumptions about relevance.


 Complex to compute in large-scale environments.

📊 Comparison Table

Feature Boolean Model Vector Model Probabilistic Model

Basis Set theory Linear Algebra Probability theory

Binary (match/no Ranked


Result Type Ranked documents
match) documents

Yes (based on
Term Weighting None (0 or 1) Yes (TF-IDF)
relevance)

Handles Partial
No Yes Yes
Match

Complexity Low Moderate High

Simple search Full-text search Advanced IR systems


Common Use
filters engines (e.g., BM25)

✅ Summary

 Boolean Model – Simple, rule-based, good for exact match.

 Vector Model – Ranks documents based on similarity, suitable for


search engines.

 Probabilistic Model – Advanced, statistical model used in modern IR


systems like Elasticsearch and Lucene.

🧠 1. Rule-Based Model (General NLP)

✅ Overview:

 Uses manually crafted linguistic rules (grammar, syntax,


morphology).

 Based on expert knowledge and dictionaries.

 No learning from data.

🧩 Example Use:
 POS tagging using handcrafted rules (e.g., "if a word ends with 'ly', tag
it as an adverb").

🧠 Pros:

 Interpretable, deterministic.

 Useful in domain-specific or low-resource languages.

⚠️Cons:

 Time-consuming to develop.

 Poor generalization to unseen data.

 Difficult to scale for real-world language complexities.

📈 2. Statistical Model

✅ Overview:

 Uses probabilities derived from large corpora.

 Learns language patterns using machine learning.

 Examples: Hidden Markov Models (HMM), n-grams, Maximum


Entropy Models.

🧩 Example Use:

 POS tagging using HMMs

 Speech recognition using n-gram language models

🧠 Pros:

 Better adaptability to unseen data.

 More accurate than rule-based for many tasks.

⚠️Cons:

 Requires large datasets.

 May produce unnatural output in some cases.

🔍 3. Information Retrieval (IR) Model


✅ Overview:

 Focuses on retrieving relevant documents given a user query.

 Based on models like Boolean, Vector Space, and Probabilistic


(BM25).

🧩 Example Use:

 Search engines, document ranking, and question answering systems.

🧠 Pros:

 Efficient retrieval over large text corpora.

 Supports ranking of relevance.

⚠️Cons:

 Doesn't understand deep semantics.

 Focuses more on matching than true understanding.

🌍 4. Rule-Based Machine Translation (RBMT)

✅ Overview:

 Translates text using predefined linguistic rules and bilingual


dictionaries.

 Language pairs are mapped using transfer rules.

🧩 Example Use:

 Translating between grammatically similar languages using syntactic


rules.

🧠 Pros:

 High accuracy in limited domain or formal language.

 Human-readable transformation rules.

⚠️Cons:

 Extremely resource-intensive to build.

 Fails with idiomatic or ambiguous language.

 Not robust to informal or spoken input.


📊 5. Probabilistic Graphical Models (PGM)

✅ Overview:

 Combines graph theory and probability to model complex


dependencies.

 Examples: Bayesian Networks, Markov Random Fields,


Conditional Random Fields (CRF).

🧩 Example Use:

 NER, POS tagging, sequence labeling using CRFs.

🧠 Pros:

 Captures dependencies and uncertainty in language.

 Well-suited for structured prediction tasks.

⚠️Cons:

 Computationally intensive.

 Requires labelled data and expertise in probabilistic modeling.

📊 Summary Comparison Table

Rule- Probabilistic
Feature / Statistical Rule-
Based IR Model Graphical
Model NLP Based MT
NLP Model

Human Data/ Term Translation Graph +


Based on
rules statistics matching rules Probability

Learning ✅/❌
❌ No ✅ Yes ❌ No ✅ Yes
from data? (depends)

Moderate to High for


Flexibility Low Low High
high retrieval

Interpretabili
High Moderate High High Moderate
ty

Scalability Low High High Low Moderate


Rule- Probabilistic
Feature / Statistical Rule-
Based IR Model Graphical
Model NLP Based MT
NLP Model

Example Grammar POS tagging, Search Language NER, POS, Info


Task check MT Engine Translation Extraction

✅ Final Notes:

 Modern NLP systems (like BERT, GPT) use deep learning, which
has largely replaced these classical models, though PGMs and
statistical models are still used in low-resource and explainable NLP
systems.

 Rule-based and IR models are still used in hybrid systems where


interpretability is critical (e.g., legal, medical).

Unit II: Linguistics and


Morphology
1. Phonetics

✅ Definition:

Phonetics is the study of the physical sounds of human speech — how


speech sounds are produced, transmitted, and received.

🔍 Key Aspects of Phonetics:

Type of
Description
Phonetics

Articulatory Studies how speech sounds are produced using vocal


Phonetics organs (e.g., lips, tongue, vocal cords).

Acoustic Studies the physical properties of sound waves


Phonetics (frequency, amplitude, duration).

Auditory
Focuses on how listeners perceive speech sounds.
Phonetics

🧩 Applications in NLP:
 Speech Recognition Systems: Understanding sound variations
(accents, pronunciation).

 Text-to-Speech (TTS): Generating natural-sounding speech.

 Voice Biometrics: Identifying users based on vocal characteristics.

Example:

The word “cat” is made up of these phonetic sounds:

 /k/ – voiceless velar stop

 /æ/ – low front vowel

 /t/ – voiceless alveolar stop

🧠 2. Phonology

✅ Definition:

Phonology is the study of how sounds function in a particular language


or languages — the abstract, mental representation of sounds.

🔍 Key Concepts:

Concept Explanation

The smallest sound unit that can change meaning (e.g., /p/ vs.
Phoneme
/b/ in pat vs bat).

Variations of a phoneme that do not change meaning (e.g.,


Allophone
aspirated /pʰ/ in pin vs. unaspirated /p/ in spin).

Minimal Pairs of words that differ by only one phoneme (e.g., bit vs.
Pairs pit).

Syllable Rules on how sounds are organized into syllables (onset,


Structure nucleus, coda).

🧩 Applications in NLP:

 Pronunciation modeling in speech synthesis.

 Language modeling for ASR (Automatic Speech Recognition).

 Phonological rules used in grammar correction and spelling


prediction.
Example:

In English, the plural ending "-s" is pronounced differently based on the final
sound of the noun:

 cats → /s/

 dogs → /z/

 buses → /ɪz/

This variation is phonological and follows specific sound rules.

🔄 Phonetics vs. Phonology — Comparison

Feature Phonetics Phonology

Physical properties of speech Abstract, mental representation


Focus
sounds of sounds

How sounds function and relate


Scope How sounds are made and heard
to each other

Units of Phonemes (meaningful sound


Phones (actual sounds)
Study units)

Spectrograms, waveforms,
Tools Phonemic charts, minimal pairs
articulatory diagrams

NLP Use Pronunciation dictionaries,


TTS, ASR
Case language rules

🎯 Summary

 Phonetics = the science of sounds (how they are produced,


transmitted, and heard).

 Phonology = the rules and patterns of how sounds are organized in a


language.

Both are crucial for speech-based NLP tasks and understanding the
linguistic backbone of natural language.

🧱 1. Morphology
✅ Definition:

Morphology is the study of the structure of words—how they are formed


and how they relate to other words.

🔹 Key Term: Morpheme

 A morpheme is the smallest meaningful unit of language.

 Types:

o Free morpheme: Can stand alone (e.g., "book", "run").

o Bound morpheme: Cannot stand alone (e.g., "-s", "-ed", "un-").

🧩 Applications in NLP:

 Lemmatization & stemming

 Spell checking

 Machine translation

2. Syntax

✅ Definition:

Syntax studies the structure of sentences — how words are combined to


form grammatically correct phrases and sentences.

🔹 Concepts:

 Phrases (noun phrase, verb phrase)

 Parse trees

 Part-of-Speech (POS) tagging

🧩 Applications in NLP:

 Grammar checking

 Sentence parsing

 Text generation

💬 3. Semantics
✅ Definition:

Semantics is the study of meaning in language — both of individual words


and how meaning is constructed in phrases/sentences.

🔹 Examples:

 Word sense disambiguation (e.g., “bank” as a riverbank vs financial)

 Semantic roles (who did what to whom)

🧩 Applications in NLP:

 Question answering

 Chatbots

 Semantic search

🧠 4. Pragmatics

✅ Definition:

Pragmatics deals with language in context — how meaning is interpreted


based on situation, speaker intent, and shared knowledge.

🔹 Examples:

 “Can you pass the salt?” → A request, not a question about ability.

🧩 Applications in NLP:

 Conversational AI

 Sentiment analysis

 Emotion detection

🧭 5. Semiotics

✅ Definition:

Semiotics is the study of signs and symbols and their use or


interpretation in communication.

🔹 Aspects:

 Signifier (form) and Signified (meaning)


 Language as a system of signs

🧩 Applications:

 Symbolic reasoning

 Brand and visual language analysis

 Human-computer interaction

📚 6. Discourse Analysis

✅ Definition:

Discourse analysis studies language beyond the sentence level —


including coherence, context, and structure in conversation or text.

🔹 Topics:

 Turn-taking in conversation

 Co-reference resolution (e.g., “John was tired. He went home.”)

 Discourse markers ("however", "therefore")

🧩 Applications in NLP:

 Dialogue systems

 Summarization

 Narrative analysis

🧬 7. Psycholinguistics

✅ Definition:

Psycholinguistics examines how language is processed in the human


brain, including how people understand, produce, and acquire language.

🔹 Areas:

 Language acquisition

 Word recognition

 Sentence processing
🧩 Applications in NLP:

 Cognitive modeling

 Speech recognition and error prediction

 Assistive technologies

8. Corpus Linguistics

✅ Definition:

Corpus linguistics involves the study of language through large collections of


real-world text data (corpora).

🔹 Components:

 Annotated corpora (POS tags, syntactic trees, semantic roles)

 Frequency analysis

 Concordance analysis

🧩 Applications in NLP:

 Training language models

 POS tagging

 Collocation extraction

 Sentiment analysis

🧠 Summary Table:

Concept Focus NLP Use Cases

Word structure & Lemmatization, stemming,


Morphology
formation tokenization

Syntax Sentence structure Parsing, POS tagging, syntax trees

Meaning of words & QA systems, WSD, knowledge


Semantics
sentences graphs

Pragmatics Contextual meaning Chatbots, intent detection


Concept Focus NLP Use Cases

Semiotics Symbols and meaning Symbolic AI, UX design

Discourse Beyond sentence-level Summarization, co-reference


Analysis meaning resolution

Psycholinguistic Brain’s language


Cognitive NLP, speech prediction
s processing

Corpus Study of language via Model training, text mining,


Linguistics datasets language resources

📚 1. Word Formation Processes

These are the different ways in which new words are formed in a language.
Understanding these processes helps in morphological analysis, tokenization,
and language generation.

🔹 Major Types:

Process Description Example

Adding prefixes or suffixes to create happy → unhappy, play


Derivation
new words → playful

Modifying a word to express tense,


walk → walked, cat →
Inflection number, etc., without changing its
cats
category

Joining two words to form a new


Compounding notebook, blackboard
word

telephone → phone,
Clipping Shortening a word
influenza → flu

brunch (breakfast +
Blending Combining parts of two words lunch), smog (smoke +
fog)

Acronyms &
Forming words from initial letters NASA, FBI
Initialisms

Conversion Changing the grammatical category Google (noun) → to


Process Description Example

without modification Google (verb)

🧩 2. Morphological Analysis

✅ Definition:

Morphological analysis is the process of breaking down a word into its


base form (root) and its morphemes (smallest units of meaning).

🔍 Example:

 Word: "unhappiness"

o Root: happy

o Prefix: un- (negation)

o Suffix: -ness (noun-forming)

🧠 Uses in NLP:

 Lemmatization (finding base form)

 Stemming

 Machine Translation

 Spell Checkers

🔧 Tools:

 Rule-based analyzers

 Morphological dictionaries

 Finite State Transducers (see next section)

⚙️3. Morphological Finite State Transducers (FSTs)

✅ Definition:

A Finite State Transducer (FST) is an automaton (like a state machine)


used for modeling the mapping between two sets of symbols. In
morphology, it maps surface forms (inflected) to lexical forms (base +
features).
🧠 Think of it like:

 Input: "running"

 Output: run + V + Prog

🔹 Structure:

 States (nodes)

 Transitions (edges with input/output)

 Accept states

 Alphabet (symbols or letters)

📜 How it works:

It reads a word character-by-character, matching rules (like suffix patterns)


to derive its root and grammatical properties.

🔍 Example Transition:

perl

CopyEdit

State 0:

"r" → "r" → State 1

"u" → "u" → State 2

...

"ing" → "+V+Prog" → Final state

💡 Applications:

 Morphological parsing (analyze form)

 Morphological generation (create word forms from root + grammar)

 Speech recognition and synthesis

 Language learning tools

🧪 Example Output:

text

CopyEdit
Input: "walked"

Output: walk + V + Past

Implementation in NLP

Popular tools that use FSTs or support morphological analysis:

Tool Description

Xerox Finite-State Tool for morphological


XFST
parsing

FOMA Open-source alternative to XFST

Hunspe
Used in spell checkers (LibreOffice, Firefox)
ll

Can be used to create simple morphological


NLTK
analyzers

🎯 Summary

Concept Description

Word Formation
How new words are created in language
Processes

Morphological
Identifying root words and affixes
Analysis

Finite State Automata used to model and process


Transducers morphological rules

Unit III: Word Level


Analysis
🧱 1. Tokenization
✅ Definition:

Tokenization is the process of splitting text into smaller units (called


tokens), such as words, subwords, or sentences.

🔹 Types:

 Word Tokenization – Splits text into words.

o Example: "I love NLP" → ["I", "love", "NLP"]

 Sentence Tokenization – Splits a paragraph into sentences.

o Example: "I love NLP. It's amazing!" → ["I love NLP.", "It's
amazing!"]

 Subword Tokenization – Useful for deep learning models (e.g.,


BERT).

o Example: "unhappiness" → ["un", "happi", "ness"]

🧠 Why it's useful:

 First step in most NLP pipelines.

 Enables other processes like POS tagging, lemmatization, etc.

🧩 2. Part-of-Speech Tagging (POS Tagging)

✅ Definition:

POS tagging is the process of labeling each word with its grammatical
category (e.g., noun, verb, adjective).

🔹 Example:

"She is eating an apple."


→ [('She', PRON), ('is', AUX), ('eating', VERB), ('an', DET), ('apple', NOUN)]

🔹 Common POS Tags:

Meanin
Tag
g

NN Noun

VB Verb

JJ Adjectiv
Meanin
Tag
g

RB Adverb

PRP Pronoun

Libraries:

 NLTK

 spaCy

 Stanford NLP

 TextBlob

🧠 Use Cases:

 Grammar checking

 Named Entity Recognition

 Machine Translation

🧬 3. Lemmatization

✅ Definition:

Lemmatization is the process of reducing a word to its base or dictionary


form (lemma), considering the context and part of speech.

🔹 Example:

"am", "are", "is" → "be"


"better" → "good"
"running" → "run"

💡 Features:

 Considers word meaning and POS

 Uses lexicons and morphological analysis

Tools:

 WordNetLemmatizer (NLTK)
 spaCy's Token.lemma_

✂️4. Stemming

✅ Definition:

Stemming is the process of removing suffixes or prefixes to find the root


form of a word.

🔹 Example:

"playing", "played", "plays" → "play"


"studies" → "studi"

💡 Characteristics:

 Often results in non-real words

 Uses heuristic-based rules

 Faster but less accurate than lemmatization

🔧 Common Algorithms:

 Porter Stemmer

 Lancaster Stemmer

 Snowball Stemmer

🔍 Comparison Table:

Tokenizati
Feature POS Tagging Lemmatization Stemming
on

Assign grammatical Reduce to base Remove


Purpose Split text
tags word suffixes

Output Tokens (word, POS) pairs Lemmas Stems

Context
❌ ✅ ✅ ❌
aware?


Real words? ✅ ✅ ✅
(sometimes)
Tokenizati
Feature POS Tagging Lemmatization Stemming
on

Speed Fast Moderate Moderate Fast

1. Named Entity Recognition (NER)

✅ Definition:

NER is the task of identifying and classifying named entities in text into
predefined categories such as:

 Person names

 Organizations

 Locations

 Dates, Time

 Monetary values

 Percentages

🔍 Example:

"Barack Obama was born in Hawaii and served as President of the United
States."
NER Tags:

 Barack Obama → PERSON

 Hawaii → LOCATION

 President → TITLE

 United States → LOCATION/ORGANIZATION

🔧 NER Categories (Typical):

Entity Type Example

PERSON Elon Musk

India,
LOCATION
Himalayas

ORGANIZATIO
Google, UN
N
Entity Type Example

15 August
DATE
1947

TIME 10:30 AM

MONEY $100, ₹500

Libraries:

 spaCy (ent.label_)

 Stanford NLP

 Flair

 Hugging Face transformers (e.g., BERT NER models)

🧠 Applications:

 Information extraction

 Question answering

 Resume parsing

 Chatbots

🧠 2. Word Sense Disambiguation (WSD)

✅ Definition:

WSD is the process of determining the correct meaning (sense) of a word


based on its context, especially for words with multiple meanings.

🔍 Example:

Word: "bank"

 "He went to the bank to deposit money." → financial institution

 "He sat on the bank of the river." → side of a river

🔧 Approaches:

Approach Description

Knowledge- Uses dictionaries like WordNet


Approach Description

based

Uses annotated corpora for


Supervised
training

Clusters word usage without


Unsupervised
labeling

Contextual Uses transformers (e.g., BERT,


models GPT)

Tools:

 NLTK + WordNet

 Lesk Algorithm (classic)

 Deep Learning with Hugging Face models

🧠 Applications:

 Machine translation

 Search engines

 Semantic analysis

 Question answering systems

🔡 3. Word Embedding

✅ Definition:

Word Embedding is a technique where words are represented as dense


vectors in a continuous vector space, where similar words have similar
representations.

🔍 Why?

It captures semantic meaning of words based on context and co-


occurrence in large corpora.

🧠 Key Ideas:

 Words used in similar contexts are closer in vector space.


 Unlike one-hot encoding, word embeddings are dense and low-
dimensional.

🔧 Popular Word Embedding Models:

Model Description

Word2Vec Predicts a word from its context (or vice versa)

GloVe Uses word co-occurrence matrix

FastText Includes subword information

BERT Contextualized, dynamic embeddings from


Embeddings transformers

🌐 Example:

python

CopyEdit

from gensim.models import Word2Vec

# Sample sentence

sentences = [["dog", "barks"], ["cat", "meows"]]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get embedding for a word

vector = model.wv["dog"]

🔍 Word2Vec Example Similarities:

 model.wv.most_similar("king") → might return ["queen", "prince",


"monarch"]

🧠 Applications:

 Sentiment analysis

 Recommendation systems

 Chatbots

 Text classification
 Similarity detection

🔁 Summary Table:

Concept Description Key Use Tool Examples

Identify names, locations, spaCy, NLTK,


NER Info extraction
etc. Transformers

Identify correct meaning WordNet, BERT,


WSD Semantic analysis
of a word in context Lesk

Word Represent words as Word similarity, Word2Vec, GloVe,


Embedding vectors NLP modeling BERT

1. ✅ Rule-Based POS Tagging

📌 Definition:

Uses hand-written rules and lexicons (dictionaries) to assign POS tags


based on patterns in the words and their neighbors.

🔧 How it works:

 A dictionary provides possible tags for each word.

 Syntactic rules determine the most likely tag based on context.

o e.g., If a word is preceded by a determiner (like "the"), it's likely a


noun.

🧪 Example:

“The boy played football.”

 Rule: If a word follows "The", tag it as a noun → "boy" = NOUN

✅ Pros:

 Simple, interpretable

 Doesn’t require training data

❌ Cons:

 Hard to scale (requires many rules)

 Less accurate than statistical methods


2. 🎲 Stochastic (Statistical) POS Tagging

📌 Definition:

Uses probability and statistics from annotated corpora to determine the


most likely tag for a word.

🔧 Methods:

 Unigram Tagging: Assigns the most frequent tag for each word.

 Bigram/Trigram Tagging: Considers previous one or two tags as


context.

 Hidden Markov Models (HMM): Uses probabilistic models for


sequences of words and tags.

🧪 Example:

Word: “play”

 In “They play cricket” → verb

 In “a play by Shakespeare” → noun

 The tag with the highest probability is selected based on context.

✅ Pros:

 Learns from data

 Higher accuracy with more training

❌ Cons:

 Requires large annotated corpora

 Can struggle with unknown words

3. 🔁 Transformation-Based (Brill) POS Tagging

📌 Definition:

A hybrid approach that combines rule-based and statistical methods. It starts


with initial tagging (usually unigram), then refines it using
transformation rules learned from training data.
🧠 Example:

 Initial tagging: "can" → noun

 Transformation rule: If "can" is followed by a verb, it’s likely a modal


verb

🔧 Key Steps:

1. Initial tagging

2. Learn transformation rules from corpus

3. Apply rules iteratively to improve accuracy

✅ Pros:

 High accuracy

 Learns interpretable rules from data

❌ Cons:

 Slower to train

 Needs both annotated data and rule generation

4. 📚 Lexical POS Tagging

📌 Definition:

Uses a lexicon (dictionary) where each word is associated with its possible
POS tags based on usage frequency or pre-defined mapping.

🧠 Characteristics:

 Doesn't look much at surrounding words (unlike stochastic methods).

 Works well for words with unique or less ambiguous tags.

🧪 Example:

 "and" → Always tagged as CONJUNCTION

 "the" → Always tagged as DETERMINER

✅ Pros:

 Simple and fast


 Effective for known, unambiguous words

❌ Cons:

 Poor at resolving ambiguity

 Doesn’t handle context well

🧾 Summary Table:

Learns from Context Accura


Type Method Used
Data? Awareness cy

Rule-Based Hand-coded rules ❌ ✅ (via rules) Medium

Stochastic Probabilities, HMM ✅ ✅ High

Transformation- Rules learned


✅ ✅ High
Based from data

Word-tag Low-
Lexical ❌ ❌
dictionary Med

Tools & Libraries Supporting All:

 NLTK (Python)

 spaCy

 Stanford CoreNLP

 OpenNLP

 AllenNLP

1. Hidden Markov Model (HMM)

✅ Definition:

An HMM is a statistical model used for sequence prediction, where the


system being modeled is assumed to be a Markov process with hidden
states.

In NLP, it is widely used for:


 Part-of-Speech Tagging

 Named Entity Recognition

 Speech Recognition

🧠 Key Concepts:

 States: POS tags (e.g., Noun, Verb)

 Observations: Words in the sentence

 Transition probabilities: Probability of one tag following another

 Emission probabilities: Probability of a word given a tag

🔁 HMM Steps:

1. Start with a known POS tag.

2. Predict the next tag using transition probabilities.

3. Predict the word using emission probabilities.

4. Use Viterbi Algorithm to find the most likely tag sequence.

⚖️2. Maximum Entropy Model (MaxEnt)

✅ Definition:

A discriminative model that uses the principle of maximum entropy to


predict the most likely outcome (e.g., POS tag, NER tag) based on
contextual features.

🧠 Key Idea:

Rather than modeling sequences like HMMs, MaxEnt focuses on:

 Feature-based classification (e.g., word suffix, previous tags)

 Tries to be as unbiased as possible without contradicting known data

💡 Example Features for POS:

 Current word = "running"

 Previous word = "is"

 Word suffix = "ing"


📦 Used in:

 POS tagging

 NER

 Text classification

✅ Pros:

 Can use arbitrary, overlapping features

 More flexible than HMMs

📊 3. N-grams

✅ Definition:

N-grams are contiguous sequences of n items (words or characters) from


a given sample of text.

Example (Sentence: "I


N-Gram
love NLP")

Unigram
I, love, NLP
(n=1)

Bigram
I love, love NLP
(n=2)

Trigram
I love NLP
(n=3)

📌 Applications:

 Language modeling

 Text generation

 Spelling correction

 Auto-suggestions

⚠️Limitation:

 Poor for capturing long-range dependencies

 Data sparsity in higher n-grams


🔗 4. Collocations

✅ Definition:

Collocations are combinations of words that occur together more often


than by chance.

🔍 Examples:

 “strong tea” (but not “powerful tea”)

 “make a decision” (not “do a decision”)

📊 Detected Using:

 Frequency-based methods: Pointwise Mutual Information (PMI)

 Statistical association measures: Chi-squared, t-score

📦 Applications:

 Machine translation

 Lexicography (dictionary building)

 Improving naturalness of generated text

5. Applications of Named Entity Recognition (NER)

📌 Real-World Use Cases:

Application Area Description

Information Extract people, organizations, locations from


Extraction news articles

Question
Identify named entities as answers
Answering

Resume Parsing Extract names, dates, skills from CVs

Chatbots Understand user intents better

Social Media
Detect brands, products, or sentiment drivers
Monitoring

Legal & Medical Extract patient names, drug names, legal terms
Application Area Description

Docs

🔧 Example:

Input:

“Apple Inc. launched a new iPhone in California on September 14.”

NER Tags:

 Apple Inc. → ORGANIZATION

 iPhone → PRODUCT

 California → LOCATION

 September 14 → DATE

🧾 Summary Table:

Concept Description Main Use

Probabilistic sequence model using


HMM POS Tagging, NER
hidden states

POS, NER,
MaxEnt Feature-based discriminative classifier
Sentiment

Language
N-Grams Sequence of N tokens
modeling

Collocations Frequent co-occurring phrases Natural phrasing

NER
Use of entity extraction in real domains Resume, QA, NLP
Applications

Unit IV: Syntax Analysis


1. Grammatical Formalisms
Grammatical formalisms are systems or methods used to describe the
syntactic structure of sentences in a language. They form the basis for
syntactic parsing in Natural Language Processing (NLP).

🧱 2. Context-Free Grammar (CFG)

✅ Definition:

A Context-Free Grammar (CFG) is a type of formal grammar that consists


of a set of production rules used to generate strings (sentences) in a
language.

CFGs are widely used in NLP to describe the syntax of natural languages in
a simplified, yet powerful way.

✨ Components of CFG:

A CFG is defined as a 4-tuple:


G = (N, Σ, P, S), where:

 N = Non-terminal symbols (e.g., S, NP, VP)

 Σ = Terminal symbols (e.g., actual words)

 P = Production rules (e.g., S → NP VP)

 S = Start symbol (usually S for sentence)

🔧 Example of a CFG:

plaintext

CopyEdit

S → NP VP

NP → Det N

VP → V NP

Det → "the" | "a"

N → "cat" | "dog"

V → "chased" | "saw"

🧪 Example Sentence:
"The cat chased the dog."

Parsed as:

 S

o NP (The cat)

o VP (chased the dog)

3. Grammar Rules for English

These are specific rules to describe how words combine to form phrases
and sentences in English.

🔹 Common English Grammar Rules (CFG-like):

Rule Description

A sentence consists of a noun phrase and a verb


S → NP VP
phrase

NP → Det
A noun phrase is a determiner followed by a noun
N

VP → V NP A verb phrase is a verb followed by a noun phrase

A prepositional phrase is a preposition followed by a


PP → P NP
noun phrase

NP → NP A noun phrase can be expanded with a prepositional


PP phrase

📍 Example:

"The dog in the garden barked loudly."

This can be parsed using rules:

 NP → Det N PP

 PP → P NP

 VP → V Adv

🧠 4. Syntactic Parsing
✅ Definition:

Syntactic parsing (or syntactic analysis) is the process of analyzing the


structure of a sentence according to a grammar.

It answers:

“How are the words in this sentence structured grammatically?”

📍 Types of Parsing:

Type Description

Constituency Divides sentence into sub-phrases (constituents) like NP,


Parsing VP

Dependency Analyzes grammatical relationships (e.g., subject, object)


Parsing between words

📊 Constituency Parse Tree Example (for "The cat saw a dog"):

plaintext

CopyEdit

/ \

NP VP

/\ / \

Det N V NP

| | | /\

the cat saw Det N

| |

a dog

🔗 Dependency Parse Example:

 saw → root

o cat → nsubj (subject)

o dog → dobj (object)


o the, a → determiners

🧠 Parsing Algorithms:

Algorithm Description

Top-Down Starts from the root (S) and expands using


Parsing grammar

Bottom-Up Starts from words and builds up to


Parsing sentence

For parsing with CFG in Chomsky Normal


CYK Algorithm
Form

Earley Parser Efficient for general CFGs

✅ Summary:

Concept Description

CFG Formal grammar with rules like S → NP VP

Grammar Rules
Rule sets describing English syntax
(English)

Identifying sentence structure via CFG or dependency


Syntactic Parsing
relations

Parsing Types Constituency vs. Dependency Parsing

1. Grammar Formalisms

✅ Definition:

Grammar formalisms are structured systems of rules that describe how


words combine to form grammatical sentences in a language.

These formalisms serve as the theoretical foundation for syntactic


parsing.

🔹 Types of Grammar Formalisms:


Grammar Type Description Example

Context-Free Each rule has a single non-


S → NP VP
Grammar (CFG) terminal on the left-hand side.

Lexicalized Tree Allows extended domain of Tree structure rooted


Adjoining Grammar locality; includes word-level with a verb like "eat"
(LTAG) information in grammar trees. includes its NP object

Head-Driven Highly lexicalized, based on Uses typed feature


Phrase Structure constraints and feature structures to capture
Grammar (HPSG) structures. word agreement

Based on binary relations “The cat sleeps” →


Dependency
between words (head- sleeps is the head of
Grammar
dependent). cat

Combinatory Uses category logic and


S/NP → a verb phrase
Categorial combinatory rules; highly
needing a noun phrase
Grammar (CCG) expressive.

From Chomsky’s minimalist


Minimalist Focus on economy of
program, based on operations
Grammar (MG) derivation
like merge and move.

💡 Why Grammar Formalisms Matter in NLP:

 Help define the structure of a language

 Allow for automatic syntactic parsing

 Form the basis for treebanks

🌳 2. Treebanks

✅ Definition:

A treebank is a parsed text corpus that annotates syntactic or semantic


sentence structures using a specific grammar formalism.

Each sentence is associated with a parse tree (usually a constituency or


dependency tree).
🧱 Types of Treebanks:

Type Description Example

Constituency Based on phrase structure


Penn Treebank
Treebank (CFG).

Dependency Based on head-dependent Universal Dependencies


Treebank relations. (UD)

TIGER Treebank
Hybrid Treebank Combines features of both.
(German)

📌 Notable Treebanks:

Languag
Treebank Formalism Notes
e

Widely used; contains POS


Penn Treebank English CFG-based
tags and phrase structure

Universal
Multilingu Dependency
Dependencies Cross-linguistic annotations
al grammar
(UD)

Prague
Dependency
Dependency Czech Rich morphological tagging
grammar
Treebank

Uses both dependencies and


TIGER Treebank German Hybrid
phrases

Constituency- Emphasizes word order and


NEGRA Treebank German
based structure

📊 Example (Constituency Parse from Penn Treebank):

(S

(NP (DT The) (NN dog))

(VP (VBD barked))

(. .))
🔗 Dependency Version:

 barked is the root

o dog → nsubj (nominal subject)

o The → det (determiner)

🔧 How Treebanks Are Used:

 Train and evaluate parsers (e.g., Stanford Parser, spaCy)

 Improve POS tagging and NER

 Statistical grammar induction

 Semantic role labeling

 Linguistic research

📈 Summary Table

Term Definition Example

Grammar Rule system for sentence CFG, HPSG, Dependency


Formalism formation Grammar

Annotated corpus of Penn Treebank, Universal


Treebank
syntactic trees Dependencies

Constituency
Phrase structure analysis NP → Det N
Parsing

Dependency
Head-dependent relations sleeps ← subj ← cat
Parsing

1. Efficient Parsing for Context-Free Grammars (CFGs)

✅ Context-Free Grammar (CFG) Recap:

CFGs are sets of recursive rewriting rules used to generate patterns of


strings in a language.

Example:

S → NP VP
NP → Det N

VP → V NP

🔧 Parsing Challenges:

Parsing CFGs involves finding whether a sentence can be generated by a


grammar, and if yes, how (the parse tree). Some sentences can have
multiple parse trees (ambiguity), which complicates parsing.

🚀 Efficient CFG Parsing Algorithms:

Algorithm Description Time Complexity

Top-Down Starts from start symbol (S), tries Not efficient; may lead to
Parsing to rewrite using rules infinite loops

Bottom-Up Starts from the input and works More efficient than top-
Parsing upward down

Earley Works with all CFGs, top-down and O(n³) worst-case, O(n²)
Parser bottom-up hybrid average

CYK Requires grammar in Chomsky


O(n³)
Algorithm Normal Form

Chart Uses dynamic programming to Efficient with


Parsing avoid redundant parsing memoization

📌 CYK Algorithm Overview:

 Works with Chomsky Normal Form (CNF) grammars

 Builds a parse table where each cell [i, j] stores possible non-
terminals that can derive substring from position i to j

 Time complexity: O(n³) where n is sentence length

🔹 2. Statistical Parsing

✅ What is it?
Statistical parsing uses probability to choose the most likely parse tree
when multiple parses exist.

Instead of just checking grammaticality, it asks:

“What is the most probable structure for this sentence?”

🔍 How it Works:

 Trains a parser using a treebank (annotated dataset)

 Computes probabilities from frequency counts

 Chooses parse trees based on likelihood

🔹 3. Probabilistic Context-Free Grammars (PCFGs)

✅ Definition:

A Probabilistic Context-Free Grammar (PCFG) extends CFG by assigning


probabilities to each production rule.

✨ Format:

Each rule has the form:

A → B C [p]

Where p is the probability of the rule.

Example:

S → NP VP [1.0]

NP → Det N [0.6]

NP → N [0.4]

VP → V NP [0.8]

VP → V [0.2]

🧠 Parsing with PCFGs:

 Use algorithms like Viterbi parsing (a dynamic programming method)


to find the most likely parse
 Combines syntax with statistics for better accuracy

 Deals well with ambiguity in natural language

📌 Use Case Example:

Given:

 "The cat saw the dog"

Two parse trees are possible. PCFG assigns probabilities to each and selects
the more likely one based on training data.

📊 Advantages of PCFG:

 Handles ambiguity effectively

 Learns from real data (treebanks)

 Helps in building robust NLP systems

✅ Summary Table

Key
Concept Description
Algorithm

Defines grammatical
CFG CYK, Earley
structure

Statistical Selects best parse based on


PCFG, Viterbi
Parsing data

Viterbi
PCFG CFG + rule probabilities
algorithm
Unit V:Semantic
Analysis
1. Requirements for Knowledge Representation in AI

To build intelligent systems, information must be represented in a structured


form that a machine can reason about. This is known as knowledge
representation (KR).

✅ Requirements for Good Representation:

Requirement Explanation

Representational Should be able to represent all kinds of knowledge


Adequacy required for the task (e.g., facts, rules, relationships)

Inferential Must support new knowledge derivation through


Adequacy reasoning (inference rules)

Inferential
Should allow for efficient reasoning (fast and scalable)
Efficiency

Acquisitional Easy to acquire and modify knowledge from real-world


Efficiency sources

Representation should allow modular knowledge


Modularity
chunks (easy to add/remove components)

The structure should be easy to understand, define,


Clarity
and interpret

Capable of expressing uncertainty, defaults, or


Expressiveness
exceptions where needed

🔹 2. First-Order Logic (FOL)

✅ What is FOL?

First-Order Logic (also called Predicate Logic) is a formal logic system


that extends propositional logic by adding:
 Quantifiers

 Predicates

 Variables

✨ Components of FOL:

Compone
Example Meaning
nt

Constant John, 5 Represents an object

Loves(John, Expresses a
Predicate
IceCream) relationship

Placeholder for
Variable x, y
objects

∀x, ∃x
Quantifie “for all”, “there
rs exists”

Function FatherOf(x) Returns an object

🧠 Example FOL Statement:

"Everyone loves ice cream."

scss

CopyEdit

∀x Loves(x, IceCream)

"There exists a person who is a teacher."

scss

CopyEdit

∃x Teacher(x)

✅ Why FOL is useful:

 Powerful enough to represent real-world facts


 Supports inference rules (Modus Ponens, resolution, etc.)

 Basis for AI systems like expert systems and knowledge bases

🔹 3. Description Logics (DL)

✅ What are Description Logics?

Description Logics (DLs) are a family of formal knowledge representation


languages. They are subsets of FOL, designed for representing structured
knowledge (like ontologies).

✨ DL Focuses On:

 Concepts (Classes)

 Roles (Properties/Relationships)

 Individuals (Instances)

📌 Key Features of DL:

Feature Description

Represents sets or classes (e.g., Person,


Concepts (C)
Animal)

Roles (R) Binary relations (e.g., hasChild, owns)

Individuals
Objects in the domain (e.g., Alice, Car1)
(a, b)

🧠 Example in DL Notation:

 Father ≡ Man ⊓ ∃hasChild.Person


→ A Father is a Man who has at least one child who is a Person.

 Student ⊑ Person
→ All students are persons.
✅ Applications of DL:

 Semantic Web (used in OWL - Web Ontology Language)

 Ontology creation (e.g., biological taxonomies, legal systems)

 Medical informatics

 Natural language understanding

📌 Summary Table:

Concept Description Use Case

Knowledge Format to store, interpret, and


AI systems
Representation reason about information

Formal logic with predicates and Rule-based AI,


FOL
quantifiers theorem proving

Subset of FOL for structured Semantic Web,


Description Logics
knowledge Ontologies

1. Syntax-Driven Semantic Analysis

✅ What is it?

Syntax-Driven Semantic Analysis is the process of assigning meaning


(semantics) to a sentence using its syntactic structure (typically derived
from a parse tree).

It is based on the idea that:

"The meaning of a sentence can be built compositionally from the meaning


of its parts (words/phrases) and their syntactic structure."

This concept is central in natural language processing (NLP) and


compiler design.

🧠 Key Principles:

1. Compositional Semantics:
Meaning is derived by combining meanings of subparts.
2. Parse Trees & Grammar Rules:
Semantic rules are attached to grammar productions.

3. Bottom-up or Top-down Traversal:


Parse trees are used to evaluate or generate the meaning.

📌 Example:

Grammar rule:

nginx

CopyEdit

S → NP VP

Semantic rule (attachment):

ini

CopyEdit

S.meaning = combine(NP.meaning, VP.meaning)

Let’s say:

 NP.meaning = "John"

 VP.meaning = "eats apples"

Then:

 S.meaning = "John eats apples"

This is how semantics are driven by syntax.

🔹 2. Semantic Attachments

✅ What are they?

Semantic attachments are semantic rules or semantic actions


associated with grammar rules to generate meaning.

They define how to compute the meaning of a phrase based on its


constituents.
How It Works:

Each grammar production is augmented with a semantic rule.

For example:

Production:

nginx

CopyEdit

VP → V NP

Semantic Attachment:

ini

CopyEdit

VP.meaning = apply(V.meaning, NP.meaning)

If:

 V.meaning = λx.eat(x)

 NP.meaning = "apples"

Then:

 VP.meaning = eat(apples)

💡 Tools Used:

 Lambda Calculus: For meaning representation

 Attribute Grammars: Formal systems with attributes and rules

 Syntax-directed translation: Common in compilers and NLP


pipelines

🧠 Application in NLP:

Component Role

Syntax Parser Builds the syntactic structure

Semantic Analyzer Uses attachments to build logical forms


Component Role

Intermediate
e.g., First-order logic, semantic graphs
Representation

Question answering, machine


Downstream Tasks
translation, etc.

✅ Summary

Concept Description

Syntax-Driven Semantic
Constructs meaning using syntactic structure
Analysis

Rules linked to grammar that specify how to


Semantic Attachments
compute meaning

1. Word Senses

✅ What is a Word Sense?

A word sense is a particular meaning of a word.


Most words in natural language are polysemous, meaning they have
multiple senses.

📌 Example:

The word "bank" can mean:

1. A financial institution (e.g., "I deposited money at the bank.")

2. The side of a river (e.g., "We sat on the river bank.")

These are two different senses of the same word.

🧠 Why is Word Sense Important?

Understanding which sense is intended is crucial for:

 Machine translation

 Information retrieval
 Question answering

 Text summarization

This is addressed in Word Sense Disambiguation (WSD).

🔹 2. Relations Between Word Senses

WordNet (a lexical database) defines several types of semantic


relationships between senses:

Relation Description Example

Synonym
Same meaning big ↔ large
y

Antonym
Opposite meaning hot ↔ cold
y

Hyperny animal is a hypernym of


More general
my dog

Hyponym sparrow is a hyponym of


More specific
y bird

Meronym
Part-whole wheel is a meronym of car
y

Holonym
Whole-part car is a holonym of wheel
y

Tropony Specific manner of to sprint is a troponym of


my action to run

These relations help in semantic reasoning and NLP applications.

🔹 3. Thematic Roles (Theta Roles)

✅ What are Thematic Roles?

Thematic Roles describe the roles that entities play in an event or


action, typically in relation to a verb.

They are part of semantic role labeling (SRL) in NLP.


📌 Common Thematic Roles:

Role Description Example

"John" in "John kicked the


Agent The doer of the action
ball"

Theme/ The receiver of the


"the ball" in above
Patient action

One who feels or


Experiencer "Mary" in "Mary felt cold"
perceives

"a stick" in "He hit it with a


Instrument Means used
stick"

Where the event


Location "in the park"
happens

Goal Endpoint of an action "to the store"

Source Starting point "from Delhi"

🔹 4. Selectional Restrictions

✅ What are they?

Selectional restrictions are semantic constraints that verbs (or predicates)


impose on their arguments.
They help determine which types of words can logically fill a role.

📌 Examples:

1. eat(x) expects x to be edible.

o ✅ "She ate an apple"

o ❌ "She ate a bicycle"

2. drive(x) expects x to be a vehicle.

o ✅ "He drove a car"

o ❌ "He drove a tree"

These restrictions guide syntactic parsing, semantic analysis, and


disambiguation.
✅ Summary Table

Concept Description

Word Sense Specific meaning of a word

Connections like synonymy,


Semantic Relations
hypernymy

Thematic Roles Roles entities play in actions

Selectional Semantic constraints on arguments


Restrictions of verbs

Word-Sense Disambiguation (WSD)

🔹 What is Word-Sense Disambiguation?

Word-Sense Disambiguation (WSD) is the task of determining which


sense (meaning) of a word is activated by its context in a sentence when
the word has multiple meanings.

🔍 Example:
In the sentence "He sat on the bank of the river",
WSD helps the system understand that "bank" refers to riverbank, not a
financial institution.

🔹 Approaches to WSD

1️⃣ Supervised Learning-Based WSD

Uses labeled training data where words are tagged with the correct
sense.

🧠 How it works:

 Extract features from context (e.g., surrounding words, POS tags)

 Train a machine learning model like:

o Decision Trees
o Naive Bayes

o SVM

o Neural Networks

✅ Example:

For the word “bass”:

 "I caught a bass." → fish

 "I played the bass." → music

A classifier learns to associate context with correct senses.

✔️Pros:

 High accuracy if quality data is available

❌ Cons:

 Requires large amounts of sense-annotated corpora (expensive to


create)

2️⃣ Dictionary/Gloss-Based WSD (Lesk Algorithm)

Uses definitions (glosses) from dictionaries like WordNet.

🧠 How it works (Lesk Algorithm):

 For each sense of a word:

o Compare the definition (gloss) to the context words.

o Select the sense with the most overlapping words.

✅ Example:

For "bat":

 Gloss for animal: "a nocturnal flying mammal"

 Gloss for sports: "an implement used in sports"

If context includes words like “fly”, “nocturnal” → selects animal sense.

✔️Pros:

 Doesn’t require labeled data


 Simple and interpretable

❌ Cons:

 Depends heavily on quality and coverage of dictionary

 Ignores deeper syntactic/semantic relationships

3️⃣ Thesaurus/Knowledge-Based WSD

Uses semantic networks or thesauri (like WordNet) to infer word sense


using:

 Synonyms

 Hypernyms

 Semantic similarity

🧠 Example methods:

 Path-based similarity: Shortest path in a semantic graph

 Semantic relatedness scores between words

 Sense clustering

✔️Pros:

 Doesn’t need training data

 Useful in low-resource languages

❌ Cons:

 Less accurate than supervised models

🎯 Applications of WSD

Application Role of WSD

Machine Translation Ensures correct translation of ambiguous words

Improves search relevance by understanding


Information Retrieval
query intent

Text Summarization Identifies correct sense to avoid ambiguity


Application Role of WSD

Question Answering Helps in understanding queries and extracting


Systems precise answers

Speech Recognition &


Corrects homophones (e.g., "pair" vs. "pear")
Generation

Chatbots and Virtual Enhances context understanding in


Assistants conversations

✅ Summary Table

Requires
Approach Technique Strength
Training?

Supervised ML classifiers ✅ Yes High accuracy

Dictionary- Easy to
Lesk Algorithm ❌ No
Based implement

Thesaurus- WordNet paths,


❌ No Knowledge-rich
Based synonyms

You might also like