NLP Key Points

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

What is a Chabot?

A chatbot is a computer program that can learn over time how to best interact with
humans. It can answer questions and troubleshoot customer problems, evaluate and
qualify prospects, generate sales leads and increase sales on an ecommerce site.

While working with NLP what is the meaning of?


a. Syntax
b. Semantics Syntax:
Syntax refers to the grammatical structure of a sentence.
Semantics: It refers to the meaning of the sentence.

What is the difference between stemming and lemmatization?


Stemming is a technique used to extract the base form of the words by removing
affixes from them. It is just like cutting down the branches of a tree to its stems. For
example, the stem of the words eating, eats, eaten is eat. Lemmatization is the
grouping together of different forms of the same word. In search queries,
lemmatization allows end users to query any version of a base word and get relevant
results.

What is meant by a dictionary in NLP?


Dictionary in NLP means a list of all the unique words occurring in the corpus. If some
words are repeated in different documents, they are all written just once as while
creating the dictionary.

What is term frequency?


Term frequency is the frequency of a word in one document. Term frequency can
easily be found from the document vector table as in that table we mention the
frequency of each word of the vocabulary in each document.

What is a document vector table?


Document Vector Table is used while implementing Bag of Words algorithm. In a
document vector table, the header row contains the vocabulary of the corpus and
other rows correspond to different documents. If the document contains a particular
word it is represented by 1 and absence of word is represented by 0 value.

What do you mean by corpus?


A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.
Define the following:
Stemming is a technique used to extract the base form of the words by removing
affixes from them. It is just like cutting down the branches of a tree to its stems. For
example, the stem of the words eating, eats, eaten is eat.
Lemmatization is the grouping together of different forms of the same word. In search
queries, lemmatization allows end users to query any version of a base word and get
relevant results.

What do you mean by document vectors?


Document Vector contains the frequency of each word of the vocabulary in a
particular document. In document vector vocabulary is written in the top row. Now, for
each word in the document, if it matches with the vocabulary, put a 1 under it. If the
same word appears again, increment the previous value by 1. And if the word does
not occur in that document, put a 0 under it.

What is TFIDF? Write its formula.


Term frequency–inverse document frequency, is a numerical statistic that is intended
to reflect how important a word is to a document in a collection or corpus. Chapter- 7
Natural Language Processing 4 The number of times a word appears in a document
divided by the total number of words in the document. Every document has its own
term frequency.

Explain the concept of Bag of Words.


Bag of Words is a Natural Language Processing model which helps in extracting
features out of the text which can be helpful in machine learning algorithms. In bag of
words, we get the occurrences of each word and construct the vocabulary for the
corpus. Bag of Words just creates a set of vectors containing the count of word
occurrences in the document (reviews). Bag of Words vectors are easy to interpret.
What are the steps of text Normalization?
Explain them in brief.
Text Normalization- In Text Normalization, we undergo several steps to normalize the
text to a lower level.

Sentence Segmentation - Under sentence segmentation, the whole corpus is divided


into sentences. Each sentence is taken as a different data so now the whole corpus
gets reduced to sentences.

Tokenization- After segmenting the sentences, each sentence is then further divided
into tokens. Tokens is a term used for any word or number or special character
occurring in a sentence. Under tokenization, every word, number and special
character is considered separately and each of them is now a separate token.

Removing Stop words, Special Characters and Numbers - In this step, the tokens
which are not necessary are removed from the token list.

Converting text to a common case -After the stop words removal, we convert the
whole text into a similar case, preferably lower case. This ensures that the
case-sensitivity of the machine does not consider same words as different just
because of different cases.
Stemming In this step, the remaining words are reduced to their root words. In other
words, stemming is the process in which the affixes of words are removed and the
words are converted to their base form.

Lemmatization -in lemmatization, the word we get after affix removal (also known as
lemma) is a meaningful one. With this we have normalized our text to tokens which
are the simplest form of words present in the corpus. Now it is time to convert the
tokens into numbers. For this, we would use the Bag of Words algorithm

You might also like