Natural Language Processing
Natural Language Processing
Natural Language Processing
NLP breaks human language into meaningful bits using the following three step
process:
Some more specific model types associated with supervised learning and uses
include:
If, for example, you are trying to understand how customers feel about your
product or service based on their social media comments, a relatively simple text
classification model can be built by comparing text, emoticons and the ratings
they correspond to even if they never directly use obvious watch words such as
“like,” “dislike,” “hate,” etc.
To do this, we would program our NLP AI to match words, phrases and other bytes
of text with their corresponding star rating on a large review of the dataset
(including millions of ratings). From this, our NLP can infer things like a “:/”
typically meaning someone is less-than-satisfied, since its reviews it appears in
ratings with an average of 2.2 stars.
Other potentially less-obvious things can also arise from the data, such as slang,
character swearing (for example, “#$@@”), usage frequencies of exclamation
points and question marks.
Statistics in hand, the NLP can now automatically assign “sentiment” to a purely
text input. Actions then can be developed to respond, such as alerting a customer
service agent to directly respond to a negative comment or simply measuring
feedback from consumers about a new policy, product, or service on social media.
If you use Gmail, you’ve been seeing this in action for quite some time now. It
filters out spam, and auto-sorts emails into: Primary, Social and Promotions and
Updates based on language patterns it’s identified with each category.
Speech Queries
If you, however, want to build a system capable of recognizing and responding to
speech, you’ve got a few more steps ahead of you.
Remember breaking sentences down into subject, object, verb, indirect object,
etc. in elementary school? At least a little? Then you’ve done this type of work
before.
We’ll cover a quick example with the following sentence. “Andrew is on a flight to
Bali, an island in Indonesia”.
Tokenization:
Each word is first separated and broken down into tokens. “Andrew,” “is,” “on,” “a,”
“flight,” “to,” “Bali,” “,” “an,” “island,” “in,” “Indonesia,” “.”
Punctuation also becomes part of our token set because it affects meaning.
Parts of Speech Prediction:
Then we look at each word or token and try to determine whether it is a pronoun,
adjective, verb, etc. This is done by running the sentence through a pre-trained
“parts-of-speech” classification model which has already statistically examined the
relationships within of millions of English sentences.
Lemmatization:
This examines words to find the base form of each, understanding that “person” is
often the singular form of “people,” and that “is, was, were, am” are all forms of
“to be.”
Stop Word Removal:
Articles like “the, an, a” are often removed because of their high frequency. This
can cause relational confusion. However, each NLP’s list of stop words has to be
carefully crafted as there is no standard set to remove from all model applications.
Dependency Parsing:
In this phase, syntax structure is devised by to allow the AI to understand
sentence attributes such as subject, object, etc.
This allows the AI to understand that, while John and ball are both nouns, in the
phrase “John hit the ball out of the park,” John has done the action of hitting.
Open source parsers like spaCy can be used to define the properties of each word
and build syntax trees.
Name Entity Recognition:
The goal of this phase is to extract nouns from the next, including people, brand
names, acronyms, places, dates, etc.
A good NLP model can differentiate between noun types such as June the person
and June the month based on statistical inference of its surrounding words like the
presence of the preposition “in.”
Coreference Resolution:
Coreferencing tracks pronouns across sentences in relation to their entity. It is,
many argue, one of the most difficult steps in NLP programming.
At this stage, we have our parts of speech mapped as subjects, objects, verbs and
more. But our sentence model thus far only examines one sentence at a time, and
parsing is needed to match. For example, in:
“Andrew is on a flight to Bali, an island in Indonesia. He is planning on living
there. It has a warm climate”.
After coreferencing, our model would understand “he” and “it” were the subjects
of sentences two and three, but would not yet be able to connect those pronouns
to “Andrew” and “Bali,” respectively.
Due to this complicated nature, a more detailed explanation of the coreferencing is
beyond the scope of this article but can be read about in some recent research
from Sanford University.
spaCy
An open source dependency parser using Python known for its accuracy in
mapping syntax trees, spaCy is becoming increasingly relevant in NLP
development.
textacy
Textacy is a Python library built on the spaCy parser that implements a number of
common data extraction methods.
PorterStemmer
PorterStemmer returns words to their root stems, like is, am, and are equate to
“to be.”
Summarizer
And Summarizer uses trained algorithms to extract key information and ignore
what is less relevant. Tokenizer reduces words and punctuation into individual
tokens.