NLP QB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1. Differentiate between Syntactic ambiguity and lexical ambiguity.

Syntactic Ambiguity

1. Definition: Happens when a sentence can be interpreted in multiple ways because of its
structure.
2. Example: "I saw the man with the telescope." (Did I use the telescope to see the man or did
the man have the telescope?)
3. Resolution: Solved by understanding the sentence structure.
4. Focus: On how words are arranged in a sentence.
5. Source: Complex sentence structures with multiple possible meanings.
6. Tools: Sentence parsers help figure out the correct structure.
7. Effect: Changes the overall meaning based on sentence structure.
8. Example: Different possible syntax trees for the same sentence.

Lexical Ambiguity

1. Definition: Happens when a word has more than one meaning.


2. Example: "Bank" (Could mean a place to store money or the side of a river.)
3. Resolution: Solved by using the surrounding words to figure out which meaning is correct.
4. Focus: On the different meanings of individual words.
5. Source: Words with multiple definitions.
6. Tools: Contextual clues and word sense disambiguation help identify the right meaning.
7. Effect: Changes how we understand individual words in context.

2. What is NLP? What are the applications of NLP?

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that


focuses on the interaction between computers and humans using natural language. The
goal of NLP is to enable machines to understand, interpret, and generate human
language in a way that is both meaningful and useful. NLP combines computational
linguistics, machine learning, and deep learning techniques to process and analyze
large amounts of natural language data.

NLP encompasses various tasks, including but not limited to text analysis, speech
recognition, language translation, sentiment analysis, and more. It deals with both the
syntactic (structure) and semantic (meaning) aspects of language, allowing computers
to understand not just the words, but also the context in which they are used.

Here are brief explanations of the applications of NLP you mentioned:

• Machine Translation: Automatically translates text from one language to


another.
• Database Access: Enables natural language queries to retrieve data from
databases.
• Information Retrieval: Selects and retrieves relevant documents based on a
user's query.
• Text Categorization: Sorts and classifies text into predefined topic categories.
• Extracting Data from Text: Converts unstructured text into structured data for
analysis.
• Spoken Language Control Systems: Allows voice commands to control
devices and systems.
• Spelling and Grammar Checkers: Detects and corrects spelling and
grammatical errors in text.

3. What are the levels of NLP?

1. Phonology

• Definition: Phonology is the study of the sound systems of languages. It focuses on


how sounds function within a particular language or languages and how they are
organized and used to convey meaning.
• Key Concepts:
o Phonemes: The smallest units of sound that can distinguish one word from
another in a language (e.g., the difference between the sounds /p/ and /b/ in
"pat" and "bat").
o Prosody: The rhythm, stress, and intonation of speech, which contribute to
meaning and emotion.
• Application in NLP: Phonology is important in speech recognition and text-to-
speech systems, where understanding and generating accurate sounds is crucial.

2. Morphology

• Definition: Morphology is the study of the structure and formation of words. It


examines how words are formed from smaller units called morphemes, which are the
smallest grammatical units in a language.
• Key Concepts:
o Morphemes: The smallest units of meaning in a language, such as roots,
prefixes, and suffixes (e.g., "un-", "happy", and "-ness" in "unhappiness").
o Inflectional Morphology: Changes a word’s form to express grammatical
features like tense, case, or number (e.g., "run" vs. "running").
o Derivational Morphology: Forms new words by adding prefixes or suffixes,
often changing the word’s grammatical category (e.g., "happy" →
"happiness").
• Application in NLP: Morphological analysis is essential for tasks like stemming,
lemmatization, and understanding word formation.

3. Syntax

• Definition: Syntax is the study of the rules and principles that govern the structure of
sentences in a language. It involves the arrangement of words and phrases to create
well-formed sentences.
• Key Concepts:
o Grammar: The set of rules that dictate how words can be combined to form
sentences.
o Parsing: Analyzing the syntactic structure of a sentence to understand its
grammatical components and relationships.
o Sentence Structure: Understanding how different parts of a sentence (such as
subject, verb, and object) are organized.
• Application in NLP: Syntax is crucial for parsing sentences, machine translation, and
generating grammatically correct sentences in NLP applications.

4. Semantics

• Definition: Semantics is the study of meaning in language. It focuses on how words,


phrases, and sentences convey meaning, including the relationships between different
meanings.
• Key Concepts:
o Word Sense Disambiguation: Determining the correct meaning of a word
based on context.
o Semantic Roles: Understanding the roles that entities play in actions
described by sentences (e.g., who is the agent, patient, etc.).
o Compositional Semantics: How the meaning of a whole sentence is derived
from the meanings of its parts.
• Application in NLP: Semantics is fundamental for tasks like information retrieval,
question answering, and sentiment analysis, where understanding meaning is key.

5. Reasoning

• Definition: Reasoning in the context of NLP refers to the ability of a system to draw
inferences or conclusions from information, including understanding implications,
making predictions, and solving problems based on language inputs.
• Key Concepts:
o Logical Reasoning: Using formal logic to derive conclusions from premises
(e.g., if A implies B, and A is true, then B must be true).
o Common-Sense Reasoning: Using general knowledge about the world to
make inferences that are not explicitly stated in the text.
o Inference: Drawing conclusions from text based on implied meanings or
context.
• Application in NLP: Reasoning is used in advanced applications like natural
language understanding, intelligent virtual assistants, and systems that require
decision-making based on language input, such as automated reasoning and AI-driven
dialogue systems.

4. Explain the stages of NLP?


. Morphological and Lexical Analysis
The lexicon of a language is its vocabulary that includes its words and expressions
Morphology depicts analyzing, identifying and description of structure of words
Lexical analysis involves dividing a text into paragraphs, words and the sentences

Syntactic Analysis
Syntax concerns the proper ordering of words and its affect on meaning
This involves analysis of the words in a sentence to depict the grammatical structure of the
sentence
The words are transformed into structure that shows how the words are related to each other
Eg. “the girl the go to the school”. This would definitely be rejected by the English syntactic
analyzer
E.g. “Ravi apple eats”

Semantic Analysis
Semantics concerns the (literal) meaning of words, phrases, and sentences
This abstracts the dictionary meaning or the exact meaning from context
The structures which are created by the syntactic analyzer are assigned meaning
E.g.. “colorless blue idea” .This would be rejected by the analyzer as colorless blue do not
make any sense together
E.g. “Stone eat apple”

Discourse Integration
Sense of the context
The meaning of any single sentence depends upon the sentences that precedes it and also
invokes the meaning of the sentences that follow it
E.g. the word “it” in the sentence “she wanted it” depends upon the prior discourse context

Pragmatic Analysis
Pragmatics concerns the overall communicative and social context and its effect on
interpretation
It means abstracting or deriving the purposeful use of the language in situations
Importantly those aspects of language which require world knowledge
The main focus is on what was said is reinterpreted on what it actually means
E.g. “close the window?” should have been interpreted as a request rather than an order

5. Explain the challenges in NLP?

1. Ambiguity

• Lexical Ambiguity: A single word can have multiple meanings (e.g., "bat" could
refer to a flying mammal or a sports equipment).
• Syntactic Ambiguity: A sentence can have multiple possible interpretations based on
its structure (e.g., "Visiting relatives can be tiring" could mean that the act of visiting
is tiring or that the relatives being visited are tiring).
• Semantic Ambiguity: The meaning of a sentence can be unclear (e.g., "He saw her
duck" could mean he saw her pet duck, or he saw her lower her head).

2. Contextual Understanding

• Challenge: Language often requires understanding context, which can involve prior
sentences, cultural knowledge, or situational awareness. Machines struggle to grasp
context as effectively as humans, leading to misunderstandings.
3. Sarcasm and Irony

• Challenge: Detecting sarcasm and irony is difficult because the literal meaning of the
words is often the opposite of the intended meaning. This requires nuanced
understanding beyond simple text analysis.

4. Idioms and Figurative Language

• Challenge: Idioms, metaphors, and other figurative language are not meant to be
interpreted literally (e.g., "kick the bucket" means "to die"). NLP systems often
struggle with these expressions as they rely on understanding beyond the literal
words.

5. Domain-Specific Knowledge

• Challenge: Language varies greatly between different domains (e.g., medical, legal,
technical jargon). An NLP system trained in one domain might not perform well in
another due to lack of specialized knowledge.

6. Low-Resource Languages

• Challenge: Many languages and dialects lack large datasets needed for training NLP
models. This creates a disparity in the effectiveness of NLP systems across different
languages.

7. Language Evolution

• Challenge: Language constantly evolves, with new words, phrases, and slang
emerging regularly. Keeping NLP systems up-to-date with these changes is
challenging.

8. Handling Negation

• Challenge: Understanding and processing negation (e.g., "I don't like pizza" vs. "I
like pizza") can be complex, as it often requires recognizing subtle shifts in meaning.

9. Sentiment Analysis

• Challenge: Accurately determining sentiment in text is difficult, especially in cases of


mixed emotions, sarcasm, or context-dependent opinions.

10. Data Privacy and Ethics

• Challenge: Collecting and processing language data raises concerns about user
privacy, data security, and ethical use. Ensuring that NLP models do not reinforce
biases or misuse sensitive information is a significant challenge.

11. Multilinguality and Translation


• Challenge: Building systems that work seamlessly across multiple languages or
accurately translate between them is difficult due to variations in grammar, syntax,
and cultural nuances.

12. Resource and Computation Intensity

• Challenge: Training advanced NLP models, especially with deep learning techniques,
requires significant computational resources, large datasets, and energy, which can be
costly and environmentally taxing.

6. Explain the knowledge level?

1. Phonological Level

• Knowledge: Understanding the sounds of language, like how letters are pronounced
and the rhythm of speech.
• Use: Helps in speech recognition and synthesis.

2. Morphological Level

• Knowledge: Understanding how words are built from smaller parts, like prefixes and
roots.
• Use: Helps in breaking down words into their base forms and variations (e.g.,
"running" → "run").

3. Syntactic Level

• Knowledge: Knowing how words fit together in sentences according to grammar


rules.
• Use: Helps in figuring out sentence structure and ensuring grammatical correctness.

4. Semantic Level

• Knowledge: Understanding the meaning of words and sentences and how they
combine to convey a message.
• Use: Helps in interpreting the actual meaning of text and resolving ambiguities.

5. Pragmatic Level

• Knowledge: Interpreting language based on context and the speaker's intentions.


• Use: Helps in understanding implied meanings, such as sarcasm or indirect requests.

6. Discourse Level

• Knowledge: Understanding how sentences and parts of a text relate to each other.
• Use: Helps in making sense of the overall structure and flow of a conversation or text.

7. World Knowledge Level


• Knowledge: Using general knowledge about the world to understand references and
implied information in text.
• Use: Helps in answering questions, making inferences, and understanding context
beyond the text itself.

7. Write short note on Indian language processing ?

Indian Language Processing in NLP involves creating technology to work with the many
languages spoken in India. Here’s a simplified overview:
1. Diverse Languages
• Languages: India has 22 major languages and many more dialects, each with
different scripts and grammar.
• Challenges: This diversity makes it tricky to develop universal NLP tools.
2. Resources and Tools
• Data: There aren't always large, well-annotated datasets for all Indian languages,
which makes training models harder.
• Tools: Tools like translators and text processors are being developed for major
languages like Hindi, Tamil, and Bengali.
3. Applications
• Translation: Translating text between Indian languages and English.
• Speech Recognition: Converting spoken Indian languages into text.
• Sentiment Analysis: Analyzing opinions expressed in Indian languages on social
media and other platforms.
4. Challenges
• Data Scarcity: Many languages lack enough digital resources.
• Script Differences: Various scripts and spelling differences add complexity.
• Dialects: Different regional dialects make it hard to create one-size-fits-all solutions.
5. Progress
• Research: Growing efforts to improve tools and technology for Indian languages.
• Initiatives: Government and organizations are working to develop better language
resources.

8. Explain Tokenization, Stemming, and Lemmatization? OR Write the difference between


Stemming, and Lemmatization

1. Tokenization
• Description: The process of splitting text into smaller units called tokens. Tokens are
usually words or phrases.
• Purpose: Converts a text string into manageable pieces for further analysis, such as
words or sentences.
• Example:
o Text: "The quick brown fox."
o Tokens: ["The", "quick", "brown", "fox"]
2. Stemming
• Description: Reduces words to their base or root form by removing suffixes or
prefixes.
•Purpose: Simplifies words to their root form to standardize them and group similar
words.
• Approach: Uses heuristics and algorithms to strip suffixes (e.g., "running" → "run").
• Example:
o Word: "running"
o Stemmed Word: "run"
3. Lemmatization
• Description: Reduces words to their base or dictionary form (lemma) considering the
context and part of speech.
• Purpose: Provides a more accurate base form of a word, taking into account its
meaning and grammatical role.
• Approach: Uses dictionaries and linguistic rules to return the proper base form (e.g.,
"running" → "run").
• Example:
o Word: "running"
o Lemma: "run" (contextually correct base form)

Or

Method:
• Stemming: Uses heuristic rules to remove prefixes and suffixes.
• Lemmatization: Uses dictionaries and linguistic rules to find the correct base form.
Accuracy:
• Stemming: Less accurate; may produce non-dictionary words.
• Lemmatization: More accurate; always produces valid dictionary words.
Context:
• Stemming: Ignores context and grammatical role.
• Lemmatization: Considers context and part of speech for accurate base forms.
Computational Complexity:
• Stemming: Faster and less resource-intensive.
• Lemmatization: More computationally intensive and resource-demanding.
Output:
• Stemming: May result in non-words or incorrect roots.
• Lemmatization: Produces grammatically appropriate and meaningful words.
Use Cases:
• Stemming: Suitable for tasks where speed is prioritized, such as search engines.
• Lemmatization: Preferred for applications requiring precise meaning, like text
analysis.

Stemming algorithms work by cutting off the end or the beginning of the word, taking into
account a list of common prefixes and suffixes that can be found in an inflected word. This
indiscriminate cutting can be successful in some occasions, but not always, and that is why this
approach presents some limitations.
Lemmatization, on the other hand, takes into consideration the morphological analysis of the
words. To do so, it is necessary to have detailed dictionaries which the algorithm can look
through to link the form back to its lemma.

9. What is morphology parsing? & Write short note on Survey of English Morphology

Morphology Parsing

Morphology Parsing is the process of analyzing the structure of words by breaking them
down into their smallest meaningful units, known as morphemes. Morphemes include roots,
prefixes, suffixes, and inflections. This analysis helps in understanding how words are
formed and how their parts contribute to their meaning and grammatical function.
Morphology parsing is crucial in various Natural Language Processing (NLP) tasks, such as:

• Part-of-Speech Tagging: Identifying the grammatical role of each word in a sentence.


• Named Entity Recognition: Detecting and classifying entities like names and locations.
• Machine Translation: Translating text by correctly understanding word structures in both
source and target languages.

Survey of English Morphology in NLP

Survey of English Morphology in NLP explores the methods and challenges associated
with analyzing English word structures. Key aspects include:

1. Inflection:
o Description: Modifications to words to express grammatical features like tense,
number, or case.
o Example: "play" (base form) → "played" (past tense), "cat" (singular) → "cats"
(plural).
2. Derivation:
o Description: Creating new words by adding prefixes or suffixes to base words.
o Example: "happy" (base form) → "unhappy" (with prefix) or "happiness" (with
suffix).
3. Compounding:
o Description: Combining two or more words to form a new word with a specific
meaning.
o Example: "tooth" + "brush" → "toothbrush".
4. Morphological Analysis Tools:
o Tokenizers: Break text into words and morphemes.
o Stemmers: Reduce words to their base form (e.g., "running" → "run").
o Lemmatizers: Convert words to their dictionary form based on context (e.g.,
"running" → "run").
5. Challenges:
o Irregular Forms: English has many exceptions and irregular forms that do not follow
standard patterns (e.g., "go" → "went").
o Ambiguity: Words can have multiple meanings or forms depending on context (e.g.,
"bark" as tree's outer layer vs. a dog's sound).
6. Applications:
o Search Engines: Improve search accuracy by understanding word variations.
o Text Analysis: Enhance understanding and processing of text in applications like
sentiment analysis and machine translation.

10. Explain Regular expression with types?

Regular Expressions in NLP

Regular Expressions (Regex) are patterns used to match sequences of characters in text.
They provide a powerful way to search, match, and manipulate text based on specified
patterns. In NLP, regex is used for tasks such as text preprocessing, tokenization, and data
extraction.

Basic Components

1. Literal Characters: Match exact characters.


o Example: hello matches "hello".
2. Metacharacters: Special characters that define patterns.
o Examples: ., ^, $, *, +, ?, |, [], (), \

Types of Regular Expressions

1. Basic Character Matching


o Literal Matching: Matches specific characters.
▪ Example: cat matches "cat".
o Character Classes: Matches any one of a set of characters.
▪ Example: [abc] matches "a", "b", or "c".
2. Quantifiers
o Asterisk *: Matches zero or more occurrences.
▪ Example: a* matches "", "a", "aa", etc.
o Plus +: Matches one or more occurrences.
▪ Example: a+ matches "a", "aa", "aaa", etc.
o Question Mark ?: Matches zero or one occurrence.
▪ Example: a? matches "" or "a".
3. Anchors
o Caret ^: Matches the start of a string.
▪ Example: ^cat matches "cat" at the beginning.
o Dollar $: Matches the end of a string.
▪ Example: cat$ matches "cat" at the end.
4. Groups and Ranges
o Parentheses (): Groups parts of a pattern.
▪ Example: (abc)+ matches "abc", "abcabc", etc.
o Brackets []: Defines a set of characters.
▪ Example: [a-z] matches any lowercase letter.
5. Special Characters
o Dot .: Matches any single character except newline.
▪ Example: a.c matches "abc", "a-c", etc.
o Backslash \: Escapes special characters or denotes special sequences.
▪ Example: \d matches any digit (0-9).
6. Alternation
o Pipe |: Matches either the pattern before or after the pipe.
▪ Example: cat|dog matches "cat" or "dog".

11. Write short note on a) finite automata b) N-gram model 3) Finite transducer

a) Finite Automata

Finite Automata are mathematical models used to represent and work with regular
languages. They are used in NLP to recognize patterns and process text based on specific
rules.

• Definition: A finite automaton consists of states, transitions between states, an initial


state, and one or more accepting states. It reads input symbols and transitions through
states according to predefined rules.
• Types:
o Deterministic Finite Automaton (DFA): For each state and input symbol, there is
exactly one transition.
o Nondeterministic Finite Automaton (NFA): Allows multiple transitions for a state
and input symbol, including epsilon (empty string) transitions.
• Applications in NLP:
o Pattern Matching: Used for tasks like searching and matching patterns in text.
o Tokenization: Helps in identifying and segmenting words or tokens in text.

b) N-gram Model

N-gram Model is a statistical language model used to predict the next word in a sequence
based on the preceding words.

• Definition: An N-gram model predicts the probability of a word based on the


previous N-1 words. It uses sequences of N words to build probabilistic models of
text.
• Types:
o Unigram: Considers individual words (N=1).
o Bigram: Considers pairs of consecutive words (N=2).
o Trigram: Considers triples of consecutive words (N=3), and so on.
• Applications in NLP:
o Text Prediction: Predicts the next word in text input, improving user typing
experience.
o Speech Recognition: Enhances accuracy by predicting the most likely word
sequences.
o Text Generation: Generates coherent text sequences based on learned patterns.

c) Finite Transducer

Finite Transducer is a computational model used to transform input sequences into output
sequences. It extends the concept of finite automata by incorporating output symbols.

• Definition: A finite transducer consists of states, transitions, an initial state, and


output functions that map input sequences to output sequences. It can produce an
output based on the input it processes.
• Types:
o Deterministic Finite Transducer (DFT): For each state and input, there is exactly one
transition with a corresponding output.
o Nondeterministic Finite Transducer (NFT): Allows multiple transitions with possible
outputs for each state and input.
• Applications in NLP:
o Text Processing: Used in tasks like morphological analysis, where input words are
transformed into their root forms.
o Machine Translation: Translates text from one language to another by mapping
input text to target language output.

12. How n-gram model is used for spelling correction? & also explain the variations of N gram
model

• Spelling Correction:

• Detection: Identifying misspelled words.


• Correction: Suggesting the correct words for the identified errors.

• Types of Spelling Errors:

• Phonetic Errors: Errors where the misspelled word sounds similar to the correct
word.
• Non-Word Errors: Errors that result in sequences not found in the lexicon or
orthographic form.
• Real-World Errors: Errors that produce valid words but are incorrect due to
typographical mistakes or other errors.

• Error Detection and Correction:

• Non-Word Errors:
o Result in invalid strings of characters.
o Example: Trigrams like 'qst' or bigrams like 'qd' are not valid English
combinations.
• Real-World Errors:
o Result in actual words but incorrect in context.
o Often caused by typos or common spelling mistakes.

• N-gram Model for Error Detection:

• Usage: Helps detect both non-word and real-world errors.


• Method: Analyzes letter combination frequencies in a large corpus to identify
unlikely or rare sequences.
• Example: Identifying non-word errors using rare bigrams or trigrams that do not
frequently occur in English.

• Training Data for N-gram Models:

• Requirement: Requires a large corpus or dictionary.


• Purpose: Helps compile an N-gram table of possible letter combinations for accurate
detection and correction.

• N-gram Probability Calculation:

• Chain Rule Formula:

• Function: Determines the probability of a word sequence based on the probabilities


of preceding words.

Variations of N-gram Models

1. Unigram Model:
o Description: Considers single words independently, without regard to their context.
o Application: Useful for simple frequency counts and basic predictions.
o Limitation: Lacks contextual information, which limits its predictive power.
2. Bigram Model:
o Description: Considers pairs of consecutive words.
o Application: Captures some context and dependencies between words, making it
more effective than a unigram model.
o Example: Predicts the next word based on the preceding word.
o Limitation: Still limited in capturing longer context dependencies.
3. Trigram Model:
o Description: Considers triples of consecutive words.
o Application: Provides a better understanding of context by incorporating more
previous words.
o Example: Predicts the next word based on the previous two words.
o Limitation: Requires more data and memory, and may not handle very long contexts
well.
13. Explain N-gram Sensitivity to the Training Corpus

The performance of an N-gram Model depends a lot on the quality and size of the training
data. Here’s a simplified explanation:

1. Coverage:
o Large Corpus: Includes many word sequences, making the model better at
predicting and understanding language.
o Small Corpus: Might miss many sequences, leading to less accurate
predictions.
2. Handling Unseen Sequences:
o Small Corpus: Might not include all possible sequences, causing the model to
struggle with new or rare combinations.
o Solution: Use smoothing techniques to estimate probabilities for unseen
sequences.
3. Prediction Accuracy:
o Well-Represented Corpus: Provides better predictions by covering a wide
range of contexts and word sequences.
o Underrepresented Corpus: Results in less accurate predictions due to
missing context.
4. Model Complexity:
o Large Corpus: Supports more complex models (like 4-grams or 5-grams) that
understand longer sequences.
o Small Corpus: Limits the model to simpler forms (like unigrams or bigrams),
capturing less context.
5. Domain Adaptation:
o Domain-Specific Corpus: Helps the model understand specific terms and
contexts (e.g., medical or legal terms).
o General Corpus: May not be as effective for specialized fields.
6. Frequency and Distribution:
o Balanced Corpus: Includes a mix of common and rare sequences, making the
model more robust.
o Biased Corpus: Skewed towards certain topics, which might limit the model’s
performance in other areas.

You might also like