Lemmatization Is The Grouping Together of Different Forms of The Same Word. in Search
Lemmatization Is The Grouping Together of Different Forms of The Same Word. in Search
Lemmatization Is The Grouping Together of Different Forms of The Same Word. in Search
15) Does the vocabulary of a corpus remain the same before and after text
normalization? Why?
Ans) No, the vocabulary of a corpus does not remain the same before and after text
normalization. Reasons are –
In normalization the text is normalized through various steps and is lowered to
minimum vocabulary since the machine does not require grammatically correct
statements but the essence of it.
In normalization Stop words, Special Characters and Numbers are removed.
In stemming the affixes of words are removed and the words are converted to
their base form. So, after normalization, we get the reduced vocabulary.
16) What is the significance of converting the text into a common case?
Ans) In Text Normalization, we undergo several steps to normalize the text to a lower
level. After the removal of stop words, we convert the whole text into a similar case,
preferably lower case. This ensures that the case-sensitivity of the machine does not
consider same words as different just because of different cases.
22) Classify each of the images according to how well the model’s output matches the
data samples:
Ans)
Here, the red dashed line is model’s output while the blue crosses are actual data
samples.
● The model’s output does not match the true function at all. Hence the model is said to
be under fitting and its accuracy is lower.
● In the second case, model performance is trying to cover all the data samples even if
they are out of alignment to the true function. This model is said to be over fitting and
this too has a lower accuracy
● In the third one, the model’s performance matches well with the true function which
states that the model has optimum accuracy and the model is called a perfect fit.
23) Explain how AI can play a role in sentiment analysis of human beings?
Ans) The goal of sentiment analysis is to identify sentiment among several posts or even
in the same post where emotion is not always explicitly expressed. Companies use
Natural Language Processing applications, such as sentiment analysis, to identify
opinions and sentiment online to help them understand what customers think about
their products and services (i.e., “I love the new iPhone” and, a few lines later “But
sometimes it doesn’t work well” where the person is still talking about the iPhone) and
overall * Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed opinion,
which can be extremely relevant in understanding and driving purchasing decisions.
24) Why are human languages complicated for a computer to understand? Explain.
Ans) The communications made by the machines are very basic and simple. Human
communication is complex. There are multiple characteristics of the human language
that might be easy for a human to understand but extremely difficult for a computer to
understand. For machines it is difficult to understand our language.
Arrangement of the words and meaning - There are rules in human language. There are
nouns, verbs, adverbs, adjectives. A word can be a noun at one time and an adjective
some other time. This can create difficulty while processing by computers.
Analogy with programming language-
Different syntax, same semantics: 2+3 = 3+2. Here the way these statements are written
is different, but their meanings are the same that is 5.
Different semantics, same syntax: 3/2 (Python 2.7) ≠ 3/2 (Python 3) Here the
statements written have the same syntax but their meanings are different. In Python
2.7, this statement would result in 1 while in Python 3, it would give an output of 1.5.
Perfect Syntax, no Meaning - Sometimes, a statement can have a perfectly correct syntax
but it does not mean anything. In Human language, a perfect balance of syntax and
semantics is important for better understanding. These are some of the challenges we
might have to face if we try to teach computers how to understand and interact in
human language.
25) What are the steps of text Normalization? Explain them in brief
Ans) Text Normalization in Text Normalization, we undergo several steps to normalize
the text to a lower level.
Sentence Segmentation - Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole corpus gets
reduced to sentences.
Tokenisation- After segmenting the sentences, each sentence is then further divided
into tokens. Tokens is a term used for any word or number or special character
occurring in a sentence. Under tokenisation, every word, number and special character
is considered separately and each of them is now a separate token.
Removing Stop words, Special Characters and Numbers - In this step, the tokens which
are not necessary are removed from the token list. Converting text to a common case -
After the stop words removal, we convert the whole text into a similar case, preferably
lower case. This ensures that the case-sensitivity of the machine does not consider same
words as different just because of different cases.
Stemming - In this step, the remaining words are reduced to their root words. In other
words, stemming is the process in which the affixes of words are removed and the
words are converted to their base form.
Lemmatization -in lemmatization, the word we get after affix removal (also known as
lemma) is a meaningful one. With this we have normalized our text to tokens which are
the simplest form of words present in the corpus. Now it is time to convert the tokens
into numbers. For this, we would use the Bag of Words algorithm
26) Normalize the given text and comment on the vocabulary before and after the
normalization: Raj and Vijay are best friends. They play together with other friends. Raj
likes to play football but Vijay prefers to play online games. Raj wants to be a footballer.
Vijay wants to become an online gamer.
Ans) Normalization of the given text: Sentence Segmentation:
1. Raj and Vijay are best friends.
2. They play together with other friends.
3. Raj likes to play football but Vijay prefers to play online games.
4. Raj wants to be a footballer.
5. Vijay wants to become an online game
Tokenization:
Same will be done for all sentences
Here we don’t have words in different case so this step is not required for given text.
Stemming:
In this step, the remaining words are reduced to their root words. In other words,
stemming is the process in which the affixes of words are removed and the words are
converted to their base form.
Given Text
Raj and Vijay are best friends. They play together with other friends. Raj likes to play
football but Vijay prefers to play online games. Raj wants to be a footballer.
Vijay wants to become an online gamer.
Normalized Text
Raj and Vijay best friends .They play together with other friends Raj likes to play
football but Vijay prefers to play online games Raj wants to be a footballer Vijay wants
to become an online game
28) Which words in a corpus have the highest values and which ones have the least?
OR
Explain the relation between occurrence and value of a word
Stop words like - and, this, is, the, etc. have highest values in a corpus. But these words
do not talk about the corpus at all. Hence, these are termed as stop words and are
mostly removed at the pre-processing stage only. Rare or valuable words occur the least
but add the most importance to the corpus. Hence, when we look at the text, we take
frequent and rare words into consideration.
As the occurrence of words drops, the value of such words rises. These words are
termed as rare or valuable words. These words occur the least but add the most value to
the corpus.
30) What are stop words? Explain with the help of examples.
Ans) “Stop words” are the most common words in a language like “the”, “a”, “on”, “is”,
“all”. These words do not carry important meaning and are usually removed from texts.
It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of
libraries and programs for symbolic and statistical natural language processing.
31) Through a step-by-step process, calculate TFIDF for the given corpus and
mention the word(s) having highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Ans) Term Frequency Term frequency is the frequency of a word in one document.
Term frequency can easily be found from the document vector table as in that table we
mention the frequency of each word of the vocabulary in each document
Inverse Document Frequency the other half of TFIDF which is Inverse Document
Frequency. For this, let us first understand what does document frequency mean.
Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents. The document
frequency for the exemplar vocabulary would be:
Talking about inverse document frequency, we need to put the document frequency in
the denominator while the total number of documents is the numerator. Here, the total
number of documents are 3, hence inverse document frequency becomes: