NLP QB
NLP QB
NLP QB
Syntactic Ambiguity
1. Definition: Happens when a sentence can be interpreted in multiple ways because of its
structure.
2. Example: "I saw the man with the telescope." (Did I use the telescope to see the man or did
the man have the telescope?)
3. Resolution: Solved by understanding the sentence structure.
4. Focus: On how words are arranged in a sentence.
5. Source: Complex sentence structures with multiple possible meanings.
6. Tools: Sentence parsers help figure out the correct structure.
7. Effect: Changes the overall meaning based on sentence structure.
8. Example: Different possible syntax trees for the same sentence.
Lexical Ambiguity
NLP encompasses various tasks, including but not limited to text analysis, speech
recognition, language translation, sentiment analysis, and more. It deals with both the
syntactic (structure) and semantic (meaning) aspects of language, allowing computers
to understand not just the words, but also the context in which they are used.
1. Phonology
2. Morphology
3. Syntax
• Definition: Syntax is the study of the rules and principles that govern the structure of
sentences in a language. It involves the arrangement of words and phrases to create
well-formed sentences.
• Key Concepts:
o Grammar: The set of rules that dictate how words can be combined to form
sentences.
o Parsing: Analyzing the syntactic structure of a sentence to understand its
grammatical components and relationships.
o Sentence Structure: Understanding how different parts of a sentence (such as
subject, verb, and object) are organized.
• Application in NLP: Syntax is crucial for parsing sentences, machine translation, and
generating grammatically correct sentences in NLP applications.
4. Semantics
5. Reasoning
• Definition: Reasoning in the context of NLP refers to the ability of a system to draw
inferences or conclusions from information, including understanding implications,
making predictions, and solving problems based on language inputs.
• Key Concepts:
o Logical Reasoning: Using formal logic to derive conclusions from premises
(e.g., if A implies B, and A is true, then B must be true).
o Common-Sense Reasoning: Using general knowledge about the world to
make inferences that are not explicitly stated in the text.
o Inference: Drawing conclusions from text based on implied meanings or
context.
• Application in NLP: Reasoning is used in advanced applications like natural
language understanding, intelligent virtual assistants, and systems that require
decision-making based on language input, such as automated reasoning and AI-driven
dialogue systems.
Syntactic Analysis
Syntax concerns the proper ordering of words and its affect on meaning
This involves analysis of the words in a sentence to depict the grammatical structure of the
sentence
The words are transformed into structure that shows how the words are related to each other
Eg. “the girl the go to the school”. This would definitely be rejected by the English syntactic
analyzer
E.g. “Ravi apple eats”
Semantic Analysis
Semantics concerns the (literal) meaning of words, phrases, and sentences
This abstracts the dictionary meaning or the exact meaning from context
The structures which are created by the syntactic analyzer are assigned meaning
E.g.. “colorless blue idea” .This would be rejected by the analyzer as colorless blue do not
make any sense together
E.g. “Stone eat apple”
Discourse Integration
Sense of the context
The meaning of any single sentence depends upon the sentences that precedes it and also
invokes the meaning of the sentences that follow it
E.g. the word “it” in the sentence “she wanted it” depends upon the prior discourse context
Pragmatic Analysis
Pragmatics concerns the overall communicative and social context and its effect on
interpretation
It means abstracting or deriving the purposeful use of the language in situations
Importantly those aspects of language which require world knowledge
The main focus is on what was said is reinterpreted on what it actually means
E.g. “close the window?” should have been interpreted as a request rather than an order
1. Ambiguity
• Lexical Ambiguity: A single word can have multiple meanings (e.g., "bat" could
refer to a flying mammal or a sports equipment).
• Syntactic Ambiguity: A sentence can have multiple possible interpretations based on
its structure (e.g., "Visiting relatives can be tiring" could mean that the act of visiting
is tiring or that the relatives being visited are tiring).
• Semantic Ambiguity: The meaning of a sentence can be unclear (e.g., "He saw her
duck" could mean he saw her pet duck, or he saw her lower her head).
2. Contextual Understanding
• Challenge: Language often requires understanding context, which can involve prior
sentences, cultural knowledge, or situational awareness. Machines struggle to grasp
context as effectively as humans, leading to misunderstandings.
3. Sarcasm and Irony
• Challenge: Detecting sarcasm and irony is difficult because the literal meaning of the
words is often the opposite of the intended meaning. This requires nuanced
understanding beyond simple text analysis.
• Challenge: Idioms, metaphors, and other figurative language are not meant to be
interpreted literally (e.g., "kick the bucket" means "to die"). NLP systems often
struggle with these expressions as they rely on understanding beyond the literal
words.
5. Domain-Specific Knowledge
• Challenge: Language varies greatly between different domains (e.g., medical, legal,
technical jargon). An NLP system trained in one domain might not perform well in
another due to lack of specialized knowledge.
6. Low-Resource Languages
• Challenge: Many languages and dialects lack large datasets needed for training NLP
models. This creates a disparity in the effectiveness of NLP systems across different
languages.
7. Language Evolution
• Challenge: Language constantly evolves, with new words, phrases, and slang
emerging regularly. Keeping NLP systems up-to-date with these changes is
challenging.
8. Handling Negation
• Challenge: Understanding and processing negation (e.g., "I don't like pizza" vs. "I
like pizza") can be complex, as it often requires recognizing subtle shifts in meaning.
9. Sentiment Analysis
• Challenge: Collecting and processing language data raises concerns about user
privacy, data security, and ethical use. Ensuring that NLP models do not reinforce
biases or misuse sensitive information is a significant challenge.
• Challenge: Training advanced NLP models, especially with deep learning techniques,
requires significant computational resources, large datasets, and energy, which can be
costly and environmentally taxing.
1. Phonological Level
• Knowledge: Understanding the sounds of language, like how letters are pronounced
and the rhythm of speech.
• Use: Helps in speech recognition and synthesis.
2. Morphological Level
• Knowledge: Understanding how words are built from smaller parts, like prefixes and
roots.
• Use: Helps in breaking down words into their base forms and variations (e.g.,
"running" → "run").
3. Syntactic Level
4. Semantic Level
• Knowledge: Understanding the meaning of words and sentences and how they
combine to convey a message.
• Use: Helps in interpreting the actual meaning of text and resolving ambiguities.
5. Pragmatic Level
6. Discourse Level
• Knowledge: Understanding how sentences and parts of a text relate to each other.
• Use: Helps in making sense of the overall structure and flow of a conversation or text.
Indian Language Processing in NLP involves creating technology to work with the many
languages spoken in India. Here’s a simplified overview:
1. Diverse Languages
• Languages: India has 22 major languages and many more dialects, each with
different scripts and grammar.
• Challenges: This diversity makes it tricky to develop universal NLP tools.
2. Resources and Tools
• Data: There aren't always large, well-annotated datasets for all Indian languages,
which makes training models harder.
• Tools: Tools like translators and text processors are being developed for major
languages like Hindi, Tamil, and Bengali.
3. Applications
• Translation: Translating text between Indian languages and English.
• Speech Recognition: Converting spoken Indian languages into text.
• Sentiment Analysis: Analyzing opinions expressed in Indian languages on social
media and other platforms.
4. Challenges
• Data Scarcity: Many languages lack enough digital resources.
• Script Differences: Various scripts and spelling differences add complexity.
• Dialects: Different regional dialects make it hard to create one-size-fits-all solutions.
5. Progress
• Research: Growing efforts to improve tools and technology for Indian languages.
• Initiatives: Government and organizations are working to develop better language
resources.
1. Tokenization
• Description: The process of splitting text into smaller units called tokens. Tokens are
usually words or phrases.
• Purpose: Converts a text string into manageable pieces for further analysis, such as
words or sentences.
• Example:
o Text: "The quick brown fox."
o Tokens: ["The", "quick", "brown", "fox"]
2. Stemming
• Description: Reduces words to their base or root form by removing suffixes or
prefixes.
•Purpose: Simplifies words to their root form to standardize them and group similar
words.
• Approach: Uses heuristics and algorithms to strip suffixes (e.g., "running" → "run").
• Example:
o Word: "running"
o Stemmed Word: "run"
3. Lemmatization
• Description: Reduces words to their base or dictionary form (lemma) considering the
context and part of speech.
• Purpose: Provides a more accurate base form of a word, taking into account its
meaning and grammatical role.
• Approach: Uses dictionaries and linguistic rules to return the proper base form (e.g.,
"running" → "run").
• Example:
o Word: "running"
o Lemma: "run" (contextually correct base form)
Or
Method:
• Stemming: Uses heuristic rules to remove prefixes and suffixes.
• Lemmatization: Uses dictionaries and linguistic rules to find the correct base form.
Accuracy:
• Stemming: Less accurate; may produce non-dictionary words.
• Lemmatization: More accurate; always produces valid dictionary words.
Context:
• Stemming: Ignores context and grammatical role.
• Lemmatization: Considers context and part of speech for accurate base forms.
Computational Complexity:
• Stemming: Faster and less resource-intensive.
• Lemmatization: More computationally intensive and resource-demanding.
Output:
• Stemming: May result in non-words or incorrect roots.
• Lemmatization: Produces grammatically appropriate and meaningful words.
Use Cases:
• Stemming: Suitable for tasks where speed is prioritized, such as search engines.
• Lemmatization: Preferred for applications requiring precise meaning, like text
analysis.
Stemming algorithms work by cutting off the end or the beginning of the word, taking into
account a list of common prefixes and suffixes that can be found in an inflected word. This
indiscriminate cutting can be successful in some occasions, but not always, and that is why this
approach presents some limitations.
Lemmatization, on the other hand, takes into consideration the morphological analysis of the
words. To do so, it is necessary to have detailed dictionaries which the algorithm can look
through to link the form back to its lemma.
9. What is morphology parsing? & Write short note on Survey of English Morphology
Morphology Parsing
Morphology Parsing is the process of analyzing the structure of words by breaking them
down into their smallest meaningful units, known as morphemes. Morphemes include roots,
prefixes, suffixes, and inflections. This analysis helps in understanding how words are
formed and how their parts contribute to their meaning and grammatical function.
Morphology parsing is crucial in various Natural Language Processing (NLP) tasks, such as:
Survey of English Morphology in NLP explores the methods and challenges associated
with analyzing English word structures. Key aspects include:
1. Inflection:
o Description: Modifications to words to express grammatical features like tense,
number, or case.
o Example: "play" (base form) → "played" (past tense), "cat" (singular) → "cats"
(plural).
2. Derivation:
o Description: Creating new words by adding prefixes or suffixes to base words.
o Example: "happy" (base form) → "unhappy" (with prefix) or "happiness" (with
suffix).
3. Compounding:
o Description: Combining two or more words to form a new word with a specific
meaning.
o Example: "tooth" + "brush" → "toothbrush".
4. Morphological Analysis Tools:
o Tokenizers: Break text into words and morphemes.
o Stemmers: Reduce words to their base form (e.g., "running" → "run").
o Lemmatizers: Convert words to their dictionary form based on context (e.g.,
"running" → "run").
5. Challenges:
o Irregular Forms: English has many exceptions and irregular forms that do not follow
standard patterns (e.g., "go" → "went").
o Ambiguity: Words can have multiple meanings or forms depending on context (e.g.,
"bark" as tree's outer layer vs. a dog's sound).
6. Applications:
o Search Engines: Improve search accuracy by understanding word variations.
o Text Analysis: Enhance understanding and processing of text in applications like
sentiment analysis and machine translation.
Regular Expressions (Regex) are patterns used to match sequences of characters in text.
They provide a powerful way to search, match, and manipulate text based on specified
patterns. In NLP, regex is used for tasks such as text preprocessing, tokenization, and data
extraction.
Basic Components
11. Write short note on a) finite automata b) N-gram model 3) Finite transducer
a) Finite Automata
Finite Automata are mathematical models used to represent and work with regular
languages. They are used in NLP to recognize patterns and process text based on specific
rules.
b) N-gram Model
N-gram Model is a statistical language model used to predict the next word in a sequence
based on the preceding words.
c) Finite Transducer
Finite Transducer is a computational model used to transform input sequences into output
sequences. It extends the concept of finite automata by incorporating output symbols.
12. How n-gram model is used for spelling correction? & also explain the variations of N gram
model
• Spelling Correction:
• Phonetic Errors: Errors where the misspelled word sounds similar to the correct
word.
• Non-Word Errors: Errors that result in sequences not found in the lexicon or
orthographic form.
• Real-World Errors: Errors that produce valid words but are incorrect due to
typographical mistakes or other errors.
• Non-Word Errors:
o Result in invalid strings of characters.
o Example: Trigrams like 'qst' or bigrams like 'qd' are not valid English
combinations.
• Real-World Errors:
o Result in actual words but incorrect in context.
o Often caused by typos or common spelling mistakes.
1. Unigram Model:
o Description: Considers single words independently, without regard to their context.
o Application: Useful for simple frequency counts and basic predictions.
o Limitation: Lacks contextual information, which limits its predictive power.
2. Bigram Model:
o Description: Considers pairs of consecutive words.
o Application: Captures some context and dependencies between words, making it
more effective than a unigram model.
o Example: Predicts the next word based on the preceding word.
o Limitation: Still limited in capturing longer context dependencies.
3. Trigram Model:
o Description: Considers triples of consecutive words.
o Application: Provides a better understanding of context by incorporating more
previous words.
o Example: Predicts the next word based on the previous two words.
o Limitation: Requires more data and memory, and may not handle very long contexts
well.
13. Explain N-gram Sensitivity to the Training Corpus
The performance of an N-gram Model depends a lot on the quality and size of the training
data. Here’s a simplified explanation:
1. Coverage:
o Large Corpus: Includes many word sequences, making the model better at
predicting and understanding language.
o Small Corpus: Might miss many sequences, leading to less accurate
predictions.
2. Handling Unseen Sequences:
o Small Corpus: Might not include all possible sequences, causing the model to
struggle with new or rare combinations.
o Solution: Use smoothing techniques to estimate probabilities for unseen
sequences.
3. Prediction Accuracy:
o Well-Represented Corpus: Provides better predictions by covering a wide
range of contexts and word sequences.
o Underrepresented Corpus: Results in less accurate predictions due to
missing context.
4. Model Complexity:
o Large Corpus: Supports more complex models (like 4-grams or 5-grams) that
understand longer sequences.
o Small Corpus: Limits the model to simpler forms (like unigrams or bigrams),
capturing less context.
5. Domain Adaptation:
o Domain-Specific Corpus: Helps the model understand specific terms and
contexts (e.g., medical or legal terms).
o General Corpus: May not be as effective for specialized fields.
6. Frequency and Distribution:
o Balanced Corpus: Includes a mix of common and rare sequences, making the
model more robust.
o Biased Corpus: Skewed towards certain topics, which might limit the model’s
performance in other areas.