0% found this document useful (0 votes)
4 views24 pages

Natural Languag-wps Office (1)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 24

NATURAL LANGUAGE PROCESSING

Computers can understand the structured form of data like spreadsheets and the
tables in the database, but human languages, texts, and voices form an
unstructured category of data, and it gets difficult for the computer to understand
it, and there arises the need for Natural Language Processing.
Natural Language Processing, or NLP, is the sub-field of AI that is focused on
enabling computers to understand and process human languages. AI is a subfield
of Linguistics, Computer Science, Information Engineering, and Artificial
Intelligence concerned with the interactions between computers and human
(natural) languages, in particular how to program computers to process and
analyze large amounts of natural language data.
Solving a complex problem in Machine Learning means building a pipeline. In
simple terms, it means breaking a complex problem into a number of small
problems, making models for each of them and then integrating these models. A
similar thing is done in NLP. We can break down the process of understanding
English for a model into a number ofsmall pieces.
A usual interaction between machines and humans using Natural Language
Processing could go as follows:

1. Humans talk to the computer

2. The computer captures the audio

3. There is an audio to text conversion

4. Text data is processed

5. Data is converted to audio

6. The computer plays the audio file and responds to humans

Applications of Natural Language Processing


1. Chatbots
Chatbots are a form of artificial intelligence that is programmed to interact with
humans in such a way that they sound like humans themselves. Depending on the
complexity of the chatbots, they can either just respond to specific keywords or
they can even hold full conversations that make it tough to distinguish them from
humans. Chatbots are created using Natural Language Processing and Machine
Learning, which means that they understand the complexities of the English
language and find the actual meaning of the sentence and they also learn from
their conversations with humans and become better with time. Chatbots work in
two simple steps. First, they identify the meaning of the question asked and
collect all the data from the user that may be required to answer the question.
Then they answer the question appropriately.

2. Autocomplete in Search Engines

Have you noticed that search engines tend to guess what you are typing and
automatically complete your sentences? For example, On typing “game” in
Google, you may get further suggestions for “game of thrones”, “game of life” or
if you are interested in maths then “game theory”. All these suggestions are
provided using autocomplete that uses Natural Language Processing to guess
what you want to ask. Search engines use their enormous data sets to analyze
what their customers are probably typing when they enter particular words and
suggest the most common possibilities. They use Natural Language Processing to
make sense of these words and how they are interconnected to form different
sentences.

3.Voice Assistants

These days voice assistants are all the rage! Whether its Siri, Alexa, or Google
Assistant, almost everyone uses one of these to make calls, place reminders,
schedule meetings, set alarms, surf the internet, etc.
These voice assistants have made life much easier. But how do they work? They
use a complex combination of speech recognition, natural language
understanding, and natural language processing to understand what humans are
saying and then act on it. The long term goal of voice assistants is to become a
bridge between humans and the internet and provide all manner of services
based on just voice interaction. However, they are still a little far from that goal
seeing as Siri still can’t understand what you are saying sometimes!

4.Language Translator

Want to translate a text from English to Hindi but don’t know Hindi? Well, Google
Translate is the tool for you! While it’s not exactly 100% accurate, it is still a great
tool to convert text from one language to another. Google Translate and other
translation tools as well as use Sequence to sequence modeling that is a
technique in Natural Language Processing. It allows the algorithm to convert a
sequence of words from one language to another which is translation. Earlier,
language translators used Statistical machine translation (SMT) which meant they
analyzed millions of documents that were already translated from one language
to another (English to Hindi in this case) and then looked for the common
patterns and basic vocabulary of the language. However, this method was not
that accurate as compared to Sequence to sequence modeling.

5.Sentiment Analysis
Almost all the world is on social media these days! And companies can use
sentiment analysis to understand how a particular type of user feels about a
particular topic, product, etc. They can use natural language processing,
computational linguistics, text analysis, etc. to understand the general sentiment
of the users for their products and services and find out if the sentiment is good,
bad, or neutral. Companies can use sentiment analysis in a lot of ways such as to
find out the emotions of their target audience, to understand product reviews, to
gauge their brand sentiment, etc. And not just private companies, even
governments use sentiment analysis to find popular opinion and also catch out
any threats to the security of the nation.

6.Grammar Checkers

Grammar and spelling is a very important factor while writing professional reports
for your superiors and even assignments for your lecturers. After all, having major
errors may get you fired or failed! That’s why grammar and spell checkers are a
very important tool for any professional writer. They can not only correct
grammar and check spellings but also suggest better synonyms and improve the
overall readability of your content. And guess what, they utilize natural language
processing to provide the best possible piece of writing! The NLP algorithm is
trained on millions of sentences to understand the correct format. That is why it
can suggest the correct verb tense, a better synonym, or a clearer sentence
structure than what you have written.

Some of the most popular grammar checkers that use NLP include Grammarly,
WhiteSmoke, ProWritingAid, etc.

7. Email Classification and Filtering

Emails are still the most important method for professional communication.
However, all of us still get thousands of promotional Emails that we don’t want to
read. Thankfully, our emails are automatically divided into 3 sections namely,

Primary, Social and Promotions which means we never have to open the
Promotional section! But how does this work? Email services use natural language
processing to identify the contents of each Email with text classification so that it
can be put in the correct section. This method is not perfect since there are still
some Promotional newsletters in Primary, but it’s better than nothing. In more
advanced cases, some companies also use specialty anti-virus software with
natural language processing to scan the Emails and see if there are any patterns
and phrases that may indicate a phishing attempt on the employees.

8. Text Summarization

Text summarization is the process of creating a shorter version of the text with
only vital information and thus, helps the user to understand the text in a shorter
amount of time.

The main advantage of text summarization lies in the fact that it reduces user’s
time in searching the important details in the document.
There are two main approaches to summarizing text documents –

1. Extractive Method: It involves selecting phrases and sentences from the


original text and including it in the final summary.

Example:

Original Text: Python is a high-level, interpreted, interactive, and object-oriented


scripting language. Python is a great language for the beginner-level
programmers.

Extractive Summary: Python is a high-level scripting language is great language for


beginner-level programmers.

2. Abstractive Method: The Abstractive method involves generating entirely new


phrases and sentences to capture the meaning of source document.

Example:

Original Text: Python is a high-level, interpreted, interactive, and object-oriented


scripting language. Python is a great language for the beginner-level programmers

Abstractive Summary: Python is interpreted and interactive language and it is


easy to learn.

1. Text Classification
Texts are a form of unstructured information that possesses very
prosperous records inside them. Text Classifiers categorize and arrange
exceptionally a great deal with any form of textual content that we use
currently. Text classification makes it possible to assign predefined
categories to a document and organize it to help you find the information
you need or simplify some activities. For example, an application of text
categorization is spam filtering in email. A very fundamental key and
remarks for commercial enterprise would be how their merchandise is
touching their meant buyers and Text Classification offers solutions to
enterprise questions with the aid of classifying people’s opinions on the
stated brand, price, and aspects.

INTRODUCTION TO CHATBOTS

Chatbots is a computer program designed to simulate conversation with human


users, especially over the internet powered by Artificial intelligence.

Chatbots work in two simple steps. First, they identify the meaning of the
question asked and collect all the data from the user that may be required to
answer the question. Then they answer the question appropriately.

Types of Chatbots

1. Simple Chatbots (Script bots)

2. Smart Chatbots (AI based Smart bots)

HUMAN LANGUAGE VS COMPUTER LANGUAGE


 There are rules in human language. There are nouns, verbs, adverbs, and
adjectives. A word can be a noun at one time and an adjective some other
time. There are rules to provide structure to a language.
 Besides the matter of arrangement, there’s also meaning behind the
language we use. Human communication is complex. There are multiple
characteristics of the human language that might be easy for a human to
understand but extremely difficult for a computer to understand.

Let’s understand Semantics and Syntax with some examples:

1. Different syntax, same semantics: 2+3 = 3+2

Here the way these statements are written is different, but their meanings are the
same that is 5.

2. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)


Here the statements written have the same syntax but their meanings are
different.

In Python 2.7, this statement would result in 1 while in Python 3, it would give an
output of 1.5.

Multiple Meanings of a word

To understand let us have an example of the following three sentences:

1. His face turned red after he found out that he had taken the wrong bag.

What does this mean? Is he feeling ashamed because he took


another person’s bag instead of his? Is he feeling angry because he did not
manage to steal the bag that he has been targeting?

2. His face turns red after consuming the medicine.

Is he having an allergic reaction? Or is he not able to bear the taste of


that medicine?

Perfect Syntax, no Meaning


Sometimes, a statement can have a perfectly correct syntax but it does not mean
anything.

For example

Chickens feed extravagantly while the moon drinks tea.

Text processing/Data processing, Bag of words, TFIDF, NLTK


Computers can’t yet truly understand English in the way that humans do. With AI
and NLP, they are learning it fast and already doing a lot. Computers with the help
of NLP techniques try to reach to the meaning of the sentences and take action or
respond accordingly.

Since we all know that the language of computers is Numerical, the very first step
that comes to our mind is to convert our language to numbers. This conversion
takes a few steps to happen. The first step to it is Text Normalisation. Since
human languages are complex, we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data.

Let us go through Text Normalisation in detail

Text Normalisation

Text Normalisation is a process to reduce the variations in text's word forms to a


common form when the variation means the same thing.

Text normalisation simplifies the text for further processing.

Raw Text from User Text Normalization Process Output


In Text Normalisation, we undergo several steps to normalise the text to a lower
level. Before we begin, we need to understand that in this section, we will be
working on a collection of written text. That is, we will be working on text from
multiple documents and the term used for the whole textual data from all the
documents altogether is known as corpus (a large or complete collection of
writing / entire set of language data to be analyzed). Not only would we go
through all the steps of Text Normalisation, we would also work them out on a
corpus. Let us take a look at the steps:

1. Sentence Segmentation

Under sentence segmentation, the whole text is divided into individual sentences.

2. Tokenisation

After segmenting the sentences, each sentence is then further divided into
tokens. Token is a term used for any word or number or special character
occurring in a sentence. Under tokenisation, every word, number and special
character is considered separately and each of them is now a separate token.

3. Removing Stop words, Special characters and Numbers

In this step, the tokens which are not necessary are removed from the token list.
What can be the possible words which we might not require?

Stop words are the words in any language which do not add much meaning to a
sentence. They can safely be ignored without sacrificing the meaning of the
sentence.

Humans use grammar to make their sentences meaningful for the other person to
understand. But grammatical words do not add any essence to the information
which is to be transmitted through the statement hence they come under stop
words. Some examples of stop words are:
These words occur the most in any given sentence but talk very little or nothing
about the context or the meaning of it. Hence, to make it easier for the computer
to focus on meaningful terms, these words are removed.

Along with these words, the sentence might have special characters and/or
numbers. Now it depends on the type of sentence in the documents that we are
working on whether we should keep them in it or not. For example, if you are
working on a document containing email IDs, then you might not want to remove
the special characters and numbers whereas in some other textual data if these
characters do not make sense, then you can remove them along with the stop
words.

4. Converting text to a common case

After the stop words removal, we convert the whole text into a similar case,
preferably lower case. This ensures that the case-sensitivity of the machine does
not consider same words as different just because of different cases.
Here in this example, the all the 6 forms of hello would be converted to lower
case and hence would be treated as the same word by the machine.

5. Stemming

In this step, the remaining words are reduced to their root words. In other
words, stemming is the process in which the affixes of words are removed and
the words are converted to their base form.

Note that in stemming, the stemmed words (words which are we get after
removing the affixes) might not be meaningful. Here in this example as you can
see healed, healing and healer all were reduced to heal but studies was reduced
to studi after the affix removal which is not a meaningful word. Stemming does
not take into account if the stemmed word is meaningful or not. It just removes
the affixes hence it is faster.

6. Lemmatization

Stemming and lemmatization both are alternative processes to each other as the
role of both the processes same removal of affixes. But the difference between
both of them is that in lemmatization, the word we get after affix removal (also
known as lemma) is a meaningful one. Lemmatization makes sure that lemma is a
word with meaning and hence it takes a longer time to execute than stemming.
As you can see in the same example, the output for studies after affix removal has
become study instead of studi.

Difference between stemming and lemmatization can be summarized by this


example

With this we have normalised our text to tokens which are the simplest form of
words. Now it is time to convert the tokens into numbers. For this, we would use
the Bag of Words algorithm.

Bag of words (BOW)


Bag of Words is a Natural Language Processing model which helps in extracting
features out of the text which can be helpful in machine learning algorithms. In
bag of words, we get the occurrences of each word and construct the vocabulary
for the corpus.
This image gives us a brief overview about how bag of words works. Let us
assume that the text on the left in this image is the normalised corpus which we
have got after going through all the steps of text processing. Now, we put this text
into the bag of words algorithm, the algorithm returns to us the unique words out
of the corpus and their occurrences in it. As you can see at the right, it shows us a
list of words appearing in the corpus and the numbers corresponding to it shows
how many times the word has occurred in the text body. Thus, we can say that
the bag of words gives us two things

1. A vocabulary of words for the corpus

2. The frequency of these words (number of times it has occurred in the whole
corpus).

Here calling this algorithm "bag" of words symboilses that the sequence of
sentences or tokens does not matter in this case as all we need are the unique
words and their frequency in it.

Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it


2. Create Dictionary: Make a list of all the unique words occurring in the corpus.
(Vocabulary) Create document vectors: For each document in the corpus, find out
how many times the word from the unique list of words has occurred.

Create document vectors for all the documents. Let us go through all the steps
with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation,
the text becomes:

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is
because we have very little data and since the frequency of all the words is almost
the same, no word can be said to have lesser value than the other.

Step 2:

Create Dictionary

Go through all the steps and create a dictionary le, list down all the words which
occur in all three documents:

Dictionary:
Note that even though some words are repeated in different documents, they are
all written just once as while creating the dictionary, we create the list of unique
words.

Step 3: Create document vector

In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word
appears again, increment the previous value by 1. And if the word does not occur
in that document, put a 0 under it

Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.

Step 4: Repeat for all documents

Same exercise has to be done for all the documents. Hence, the table becomes:
In this table, the header row contains the vocabulary of the corpus and three
rows correspond to three different documents. Take a look at this table and
analyse the positioning of Os and 1s in it

Finally, this gives us the document vector table for our corpus. But the tokens
have still not converted to numbers. This leads us to the final steps of our
algorithm: TFIDF

TFIDF: Term Frequency & Inverse Document Frequency

Bag of words algorithm gives us the frequency of words in each document we


have in our corpus. It gives us an idea that if the word is occurring more in a
document, its value is more for that document. For example, if I have a document
on air pollution, air and pollution would be the words which occur many times in
it. And these words valuable too as they give us some context around the
document. But let us suppose we have 10 documents and all of them talk about
different issues. One is on women empowerment, the other is on unemployment
and so on. Do you think air and pollution would still be one of the most occurring
words in the whole corpus? If not, then which words do you think would have the
highest frequency in all of them?

And, this, is, the, etc. are the words which occur the most in almost all the
documents. But these words do not talk about the corpus at all. Though they are
important for humans as they make the statements understandable to us, for the
machine they are a complete waste as they do not provide us with any
information regarding the corpus. Hence, these are termed as stopwords and are
mostly removed at the pre-processing stage only.
Take a look at this graph. It is a plot of occurrence of words versus their value. As
you can see, if the words have highest occurrence in all the documents of the
corpus, they are said to have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-processing stage only. Now as
move ahead from the stop words, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some
amount of value and termed as frequent words. These words mostly talk about
the document's subject and their occurrence is adequate in the corpus. Then as
the occurrence of words drops further, the value of such words rises. These words
are termed as rare or valuable words. These words occur the least but add the
most value to the corpus. Hence, when we look at the best taka frequent words
into consideration

TFIDF stands for Term Frequency and Inverse Document Frequency.

TFIDF helps un in identifying the value for each word.

Term Frequency

Term frequency is the frequency of a word in one document. Term frequency can
easily be found from the document vector table as in that table we mention the
frequency of each word of the vocabulary in each document.
Here, you can see that the frequency of each word for each document has been
recorded in the table. These numbers are nothing but the Term Frequencies!

Inverse Document Frequency

Now, let us look at the other half of TFIDF which is Inverse Document Frequency.
For this, let us first understand what does document frequency mean. Document
Frequency is the number of documents in which the word occurs irrespective of
how many times it has occurred in those documents. The document frequency for
the exemplar vocabulary would be:

Here, you can see that the document frequency of 'aman', 'anil', 'went', 'to' and
'a' is 2 as the have occurred in two documents. Rest of them occurred in just one
document hence the documentnfrequency for them is one.

Talking about inverse document frequency, we need to put the document


frequency in the denominator while the total number of documents is the
numerator. Here, the total number of documents are 3, hence inverse document
frequency becomes:
Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W)*log(IDF(W))

Here, log is to the base of 10. Don't worry! You don't need to calculate the log
values by yourself. Simply use the log function in the calculator and find out!

Now, let's multiply the IDF values to the TF values. Note that the TF values are for
each document while the IDF values are for the whole corpus. Hence, we need to
multiply the IDF values to each row of the document vector table.

Here, you can see that the IDF values for Aman in each row is the same and
similar pattern is followed for all the words of the vocabulary. After calculating all
the values, we get:
Finally, the words have been converted to numbers. These numbers are the
values of each for each document. Here, you can see that since we have less
amount of data, words like 'are' and 'and' also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:

Total Number of documents: 10

Number of documents in which 'and' occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log (1) = 0. Hence, the value of 'and' becomes 0.

On the other hand, number of documents in which 'pollution' occurs: 3


IDF(pollution) = 10/3 = 3.3333...

Which means: log (3.3333) = 0.522; which shows that the word 'pollution' has
considerable value in the corpus.

Summarizing the concept, we can say that:

1. Words that occur in all the documents with high term frequencies have the
least values and are considered to be the stop words.

2. For a word to have high TFIDF value, the word needs to have a high term
frequency but less document frequency which shows that the word is important
for one document but is not a common word for all documents.

3. These values help the computer understand which words are to be considered
while processing the natural language. The higher the value, the more important
the word is for a given corpus.

Applications of TF-IDF

1. Document Classification: TF-IDF helps in classifying the type and genre of a


document by looking at the frequencies of words in the text. Based on the TF-IDF
values, it is easy to classify emails as spam or ham, to classify news as real or fake
and so on.

2. Topic Modelling: It helps in predicting the topic for the corpus. Topic modelling
refers to a method of identifying short and informative descriptions of a
document in a large collection that can further be used for various text mining
tasks such a summarisation, document classification etc.

3. Key word Extraction: It is also useful for extracting keywords from text.

4. Information Retrieval System: To extract the important information out of a


corpus.

5. Stop word Filtering: It helps in removing unnecessary words out of a text body.

NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing. and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion

forum.

NLTK is suitable for linguists, engineers, students, educators, researchers, and


industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of
all, NLTK is a free, open source, community-driven project.

NLTK has been called "a wonderful tool for teaching, and working in,
computational linguistics using Python," and "an amazing library to play with
natural language."

Natural Language Processing with Python provides a practical introduction to


programming for language processing. Written by the creators of NLTK, it guides
the reader through the fundamentals of writing Python programs, working with
corpora, categorizing text, analyzing linguistic structure, and more.
ACTIVITY 1

Normalize the given text

Raj and Vijay are best friends. They play together with other friends. Raj likes to
play football but Vijay prefers to play online games. Raj wants to be a footballer.
Vijay wants to become an online gamer.

ACTIVITY 2

Through a step-by-step process, calculate TFIDF for the given corpus

Document 1: Johny Johny, Yes Papa,

Document 2: Eating sugar? No Papa

Document 3: Telling lies? No Papa

Document 4: Open your mouth, Ha! Ha! Ha!

You might also like