Natural Languag-wps Office (1)
Natural Languag-wps Office (1)
Natural Languag-wps Office (1)
Computers can understand the structured form of data like spreadsheets and the
tables in the database, but human languages, texts, and voices form an
unstructured category of data, and it gets difficult for the computer to understand
it, and there arises the need for Natural Language Processing.
Natural Language Processing, or NLP, is the sub-field of AI that is focused on
enabling computers to understand and process human languages. AI is a subfield
of Linguistics, Computer Science, Information Engineering, and Artificial
Intelligence concerned with the interactions between computers and human
(natural) languages, in particular how to program computers to process and
analyze large amounts of natural language data.
Solving a complex problem in Machine Learning means building a pipeline. In
simple terms, it means breaking a complex problem into a number of small
problems, making models for each of them and then integrating these models. A
similar thing is done in NLP. We can break down the process of understanding
English for a model into a number ofsmall pieces.
A usual interaction between machines and humans using Natural Language
Processing could go as follows:
Have you noticed that search engines tend to guess what you are typing and
automatically complete your sentences? For example, On typing “game” in
Google, you may get further suggestions for “game of thrones”, “game of life” or
if you are interested in maths then “game theory”. All these suggestions are
provided using autocomplete that uses Natural Language Processing to guess
what you want to ask. Search engines use their enormous data sets to analyze
what their customers are probably typing when they enter particular words and
suggest the most common possibilities. They use Natural Language Processing to
make sense of these words and how they are interconnected to form different
sentences.
3.Voice Assistants
These days voice assistants are all the rage! Whether its Siri, Alexa, or Google
Assistant, almost everyone uses one of these to make calls, place reminders,
schedule meetings, set alarms, surf the internet, etc.
These voice assistants have made life much easier. But how do they work? They
use a complex combination of speech recognition, natural language
understanding, and natural language processing to understand what humans are
saying and then act on it. The long term goal of voice assistants is to become a
bridge between humans and the internet and provide all manner of services
based on just voice interaction. However, they are still a little far from that goal
seeing as Siri still can’t understand what you are saying sometimes!
4.Language Translator
Want to translate a text from English to Hindi but don’t know Hindi? Well, Google
Translate is the tool for you! While it’s not exactly 100% accurate, it is still a great
tool to convert text from one language to another. Google Translate and other
translation tools as well as use Sequence to sequence modeling that is a
technique in Natural Language Processing. It allows the algorithm to convert a
sequence of words from one language to another which is translation. Earlier,
language translators used Statistical machine translation (SMT) which meant they
analyzed millions of documents that were already translated from one language
to another (English to Hindi in this case) and then looked for the common
patterns and basic vocabulary of the language. However, this method was not
that accurate as compared to Sequence to sequence modeling.
5.Sentiment Analysis
Almost all the world is on social media these days! And companies can use
sentiment analysis to understand how a particular type of user feels about a
particular topic, product, etc. They can use natural language processing,
computational linguistics, text analysis, etc. to understand the general sentiment
of the users for their products and services and find out if the sentiment is good,
bad, or neutral. Companies can use sentiment analysis in a lot of ways such as to
find out the emotions of their target audience, to understand product reviews, to
gauge their brand sentiment, etc. And not just private companies, even
governments use sentiment analysis to find popular opinion and also catch out
any threats to the security of the nation.
6.Grammar Checkers
Grammar and spelling is a very important factor while writing professional reports
for your superiors and even assignments for your lecturers. After all, having major
errors may get you fired or failed! That’s why grammar and spell checkers are a
very important tool for any professional writer. They can not only correct
grammar and check spellings but also suggest better synonyms and improve the
overall readability of your content. And guess what, they utilize natural language
processing to provide the best possible piece of writing! The NLP algorithm is
trained on millions of sentences to understand the correct format. That is why it
can suggest the correct verb tense, a better synonym, or a clearer sentence
structure than what you have written.
Some of the most popular grammar checkers that use NLP include Grammarly,
WhiteSmoke, ProWritingAid, etc.
Emails are still the most important method for professional communication.
However, all of us still get thousands of promotional Emails that we don’t want to
read. Thankfully, our emails are automatically divided into 3 sections namely,
Primary, Social and Promotions which means we never have to open the
Promotional section! But how does this work? Email services use natural language
processing to identify the contents of each Email with text classification so that it
can be put in the correct section. This method is not perfect since there are still
some Promotional newsletters in Primary, but it’s better than nothing. In more
advanced cases, some companies also use specialty anti-virus software with
natural language processing to scan the Emails and see if there are any patterns
and phrases that may indicate a phishing attempt on the employees.
8. Text Summarization
Text summarization is the process of creating a shorter version of the text with
only vital information and thus, helps the user to understand the text in a shorter
amount of time.
The main advantage of text summarization lies in the fact that it reduces user’s
time in searching the important details in the document.
There are two main approaches to summarizing text documents –
Example:
Example:
1. Text Classification
Texts are a form of unstructured information that possesses very
prosperous records inside them. Text Classifiers categorize and arrange
exceptionally a great deal with any form of textual content that we use
currently. Text classification makes it possible to assign predefined
categories to a document and organize it to help you find the information
you need or simplify some activities. For example, an application of text
categorization is spam filtering in email. A very fundamental key and
remarks for commercial enterprise would be how their merchandise is
touching their meant buyers and Text Classification offers solutions to
enterprise questions with the aid of classifying people’s opinions on the
stated brand, price, and aspects.
INTRODUCTION TO CHATBOTS
Chatbots work in two simple steps. First, they identify the meaning of the
question asked and collect all the data from the user that may be required to
answer the question. Then they answer the question appropriately.
Types of Chatbots
Here the way these statements are written is different, but their meanings are the
same that is 5.
In Python 2.7, this statement would result in 1 while in Python 3, it would give an
output of 1.5.
1. His face turned red after he found out that he had taken the wrong bag.
For example
Since we all know that the language of computers is Numerical, the very first step
that comes to our mind is to convert our language to numbers. This conversion
takes a few steps to happen. The first step to it is Text Normalisation. Since
human languages are complex, we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data.
Text Normalisation
1. Sentence Segmentation
Under sentence segmentation, the whole text is divided into individual sentences.
2. Tokenisation
After segmenting the sentences, each sentence is then further divided into
tokens. Token is a term used for any word or number or special character
occurring in a sentence. Under tokenisation, every word, number and special
character is considered separately and each of them is now a separate token.
In this step, the tokens which are not necessary are removed from the token list.
What can be the possible words which we might not require?
Stop words are the words in any language which do not add much meaning to a
sentence. They can safely be ignored without sacrificing the meaning of the
sentence.
Humans use grammar to make their sentences meaningful for the other person to
understand. But grammatical words do not add any essence to the information
which is to be transmitted through the statement hence they come under stop
words. Some examples of stop words are:
These words occur the most in any given sentence but talk very little or nothing
about the context or the meaning of it. Hence, to make it easier for the computer
to focus on meaningful terms, these words are removed.
Along with these words, the sentence might have special characters and/or
numbers. Now it depends on the type of sentence in the documents that we are
working on whether we should keep them in it or not. For example, if you are
working on a document containing email IDs, then you might not want to remove
the special characters and numbers whereas in some other textual data if these
characters do not make sense, then you can remove them along with the stop
words.
After the stop words removal, we convert the whole text into a similar case,
preferably lower case. This ensures that the case-sensitivity of the machine does
not consider same words as different just because of different cases.
Here in this example, the all the 6 forms of hello would be converted to lower
case and hence would be treated as the same word by the machine.
5. Stemming
In this step, the remaining words are reduced to their root words. In other
words, stemming is the process in which the affixes of words are removed and
the words are converted to their base form.
Note that in stemming, the stemmed words (words which are we get after
removing the affixes) might not be meaningful. Here in this example as you can
see healed, healing and healer all were reduced to heal but studies was reduced
to studi after the affix removal which is not a meaningful word. Stemming does
not take into account if the stemmed word is meaningful or not. It just removes
the affixes hence it is faster.
6. Lemmatization
Stemming and lemmatization both are alternative processes to each other as the
role of both the processes same removal of affixes. But the difference between
both of them is that in lemmatization, the word we get after affix removal (also
known as lemma) is a meaningful one. Lemmatization makes sure that lemma is a
word with meaning and hence it takes a longer time to execute than stemming.
As you can see in the same example, the output for studies after affix removal has
become study instead of studi.
With this we have normalised our text to tokens which are the simplest form of
words. Now it is time to convert the tokens into numbers. For this, we would use
the Bag of Words algorithm.
2. The frequency of these words (number of times it has occurred in the whole
corpus).
Here calling this algorithm "bag" of words symboilses that the sequence of
sentences or tokens does not matter in this case as all we need are the unique
words and their frequency in it.
Create document vectors for all the documents. Let us go through all the steps
with an example:
Here are three documents having one sentence each. After text normalisation,
the text becomes:
Note that no tokens have been removed in the stopwords removal step. It is
because we have very little data and since the frequency of all the words is almost
the same, no word can be said to have lesser value than the other.
Step 2:
Create Dictionary
Go through all the steps and create a dictionary le, list down all the words which
occur in all three documents:
Dictionary:
Note that even though some words are repeated in different documents, they are
all written just once as while creating the dictionary, we create the list of unique
words.
In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word
appears again, increment the previous value by 1. And if the word does not occur
in that document, put a 0 under it
Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.
Same exercise has to be done for all the documents. Hence, the table becomes:
In this table, the header row contains the vocabulary of the corpus and three
rows correspond to three different documents. Take a look at this table and
analyse the positioning of Os and 1s in it
Finally, this gives us the document vector table for our corpus. But the tokens
have still not converted to numbers. This leads us to the final steps of our
algorithm: TFIDF
And, this, is, the, etc. are the words which occur the most in almost all the
documents. But these words do not talk about the corpus at all. Though they are
important for humans as they make the statements understandable to us, for the
machine they are a complete waste as they do not provide us with any
information regarding the corpus. Hence, these are termed as stopwords and are
mostly removed at the pre-processing stage only.
Take a look at this graph. It is a plot of occurrence of words versus their value. As
you can see, if the words have highest occurrence in all the documents of the
corpus, they are said to have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-processing stage only. Now as
move ahead from the stop words, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some
amount of value and termed as frequent words. These words mostly talk about
the document's subject and their occurrence is adequate in the corpus. Then as
the occurrence of words drops further, the value of such words rises. These words
are termed as rare or valuable words. These words occur the least but add the
most value to the corpus. Hence, when we look at the best taka frequent words
into consideration
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can
easily be found from the document vector table as in that table we mention the
frequency of each word of the vocabulary in each document.
Here, you can see that the frequency of each word for each document has been
recorded in the table. These numbers are nothing but the Term Frequencies!
Now, let us look at the other half of TFIDF which is Inverse Document Frequency.
For this, let us first understand what does document frequency mean. Document
Frequency is the number of documents in which the word occurs irrespective of
how many times it has occurred in those documents. The document frequency for
the exemplar vocabulary would be:
Here, you can see that the document frequency of 'aman', 'anil', 'went', 'to' and
'a' is 2 as the have occurred in two documents. Rest of them occurred in just one
document hence the documentnfrequency for them is one.
TFIDF(W) = TF(W)*log(IDF(W))
Here, log is to the base of 10. Don't worry! You don't need to calculate the log
values by yourself. Simply use the log function in the calculator and find out!
Now, let's multiply the IDF values to the TF values. Note that the TF values are for
each document while the IDF values are for the whole corpus. Hence, we need to
multiply the IDF values to each row of the document vector table.
Here, you can see that the IDF values for Aman in each row is the same and
similar pattern is followed for all the words of the vocabulary. After calculating all
the values, we get:
Finally, the words have been converted to numbers. These numbers are the
values of each for each document. Here, you can see that since we have less
amount of data, words like 'are' and 'and' also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:
Which means: log (3.3333) = 0.522; which shows that the word 'pollution' has
considerable value in the corpus.
1. Words that occur in all the documents with high term frequencies have the
least values and are considered to be the stop words.
2. For a word to have high TFIDF value, the word needs to have a high term
frequency but less document frequency which shows that the word is important
for one document but is not a common word for all documents.
3. These values help the computer understand which words are to be considered
while processing the natural language. The higher the value, the more important
the word is for a given corpus.
Applications of TF-IDF
2. Topic Modelling: It helps in predicting the topic for the corpus. Topic modelling
refers to a method of identifying short and informative descriptions of a
document in a large collection that can further be used for various text mining
tasks such a summarisation, document classification etc.
3. Key word Extraction: It is also useful for extracting keywords from text.
5. Stop word Filtering: It helps in removing unnecessary words out of a text body.
NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing. and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion
forum.
NLTK has been called "a wonderful tool for teaching, and working in,
computational linguistics using Python," and "an amazing library to play with
natural language."
Raj and Vijay are best friends. They play together with other friends. Raj likes to
play football but Vijay prefers to play online games. Raj wants to be a footballer.
Vijay wants to become an online gamer.
ACTIVITY 2