Natural Language Processing For Hackers
Natural Language Processing For Hackers
Natural Language Processing For Hackers
stri
butedbymanni
ng
Natural Language Processing For Hackers
Learn to build awesome apps that can understand people
George-Bogdan Ivanov
ISBN 9781617296567
DISTRIBUTED BY MANNING
This book was created solely by its Author and is being distributed by Manning Publications “as
is,” with no guarantee of completeness, accuracy, timeliness, or other fitness for use. Neither
Manning nor the Author make any warranty regarding the results obtained from the use of the
contents herein and accept no liability for any decision or action taken in reliance on the
information in this book nor for any damages resulting from this work or its application.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
What is Natural Language Processing? . . . . . . . . . . . . . . . . . . . . . . . . iv
Challenges in Natural Language Processing . . . . . . . . . . . . . . . . . . . . . v
What makes this book different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Persisting models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Try a different classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Use Ngrams Instead of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Using a Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Picking the Best Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Multi-Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Build a Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
IOB Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Implementing the Chunk Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chunker Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
MovieBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
The Movie DB API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Small-Talk Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Simple Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Execution Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Today, written text is still the most popular method of communication. We read
stuff on the Internet, we write each other emails, send texts and we read books.
We’re very comfortable with this type of communication and that’s one of the main
reasons for which chatbots are so popular right now. Being able to chat over the
Internet is not enough though. We need to quickly interact with companies and ac-
cess our various online accounts. This book’s final chapter is indeed about building
a chatbot, but I consider the book being much more. It is a great introduction to
Natural Language Processing from a practical perspective. It touches all adjacent
subjects such as: Machine Learning, Sentiment Analysis, data gathering, cleaning
and various corpora.
• You’re a Python Engineer, comfortable with the language, and interested in this
whole Machine Learning/Natural Language Processing thing.
• You’re a researcher in Machine Learning/Natural Language Processing and are
interested in seing a more practical facet of the field.
• You’re already a NLP Engineer and would like to level up.
Revision History
This book is updated regularly based on reader feedback. The plan is to create an
updated version every three or four months, depending on the quantity of feedback
and my availability. Anyone who purchased the book will have access to all further
updates of the book.
Credits
Let’s say you want to build an app to play Tic-Tac-Toe, but you don’t know how to
play this game. You can still build the app by giving it examples of previous games
and letting it learn from those examples. This approach is called Machine Learning
(ML) and it represents, in simpler terms, a method on how to make a machine
learn throughout observations from which it gains experience and improves its
performance.
Machine Learning is an absolutely huge and evolving field, under heavy research.
At this moment, its most popular subfield is Deep Learning which studies how to
build models using large neural network architectures with several layers (hence
the attribute “deep”). Due to the depth of its applicability, Deep Learning is a major
vector used to achieve undreamt results and to smash barriers we thought were
unbreakable before.
This book doesn’t require advanced Machine Learning knowledge, but some basic
information on this subject might come in handy. Either way, we will have a brief
introduction to Machine Learning, enough for you to be able to understand the
whole book, but not to get bored.
Now let’s say you also want to chat with the app while playing Tic-Tac-Toe. You will
have to make the machine learn how to understand natural (human) language.
Well, the solution for this would be Natural Language Processing (NLP), that
also goes by the name of Natural Language Understanding or Computational
Linguistics. NLP serves as a bridge between Human Language and Computer-
Generated Language and can be integrated with ML in order to achieve amazing
results.
The objective of NLP is to transform unstructured text data into knowledge (struc-
tured data) and use that in order to properly interact with humans. These days,
the most popular example is the chatbot. A chatbot is a conversational agent that
interacts directly with a human via a chat-like interface. The agent should feel
in all aspects human-like because, well, humans already know how to interact
with humans so the whole experience should be natural. The agent’s algorithm
transforms the user’s requests into actions, using internal resources or external
APIs to achieve desired goals, and communicates the results in a natural language
back to the human counterpart.
I like to think about grammar as the science of turning plain language into mathe-
matical objects. In fact, in the early days of NLP, almost all systems were Rule-Based
• Machine Learning: Nowadays, almost any NLP process is built upon a Machine
Learning Systems.
• Language & Grammar: Understanding how a language works and general
grammar rules gets you a long way.
• General AI: Knowledge about Rule-Based Systems, Knowledge-Based Systems.
• Statistics: Building language models.
• Algebra: Matrix operations.
multiple meanings is: “I saw a man on a hill with a telescope”. When reading it
you probably instantly decided that it means: “There’s a man on a hill, and
I’m watching him with my telescope.”. But there are another four possible
meanings and I will leave you the pleasure of discovering them. (Answers:
http://www.byrdseed.com/ambiguous-sentences/1 )
• Emotions - Emotions are hard to define and even harder to identify. We can
think of them as complex states of feeling which are often intertwined with
personality and mood. One of the popular challenges in NLP is called Sentiment
Analysis and it implies detecting the polarity of a text (positive vs negative).
It has applications in analyzing huge numbers of product reviews, quantify
online presence and so on. Here’s a head-scratcher:
“The best I can say about this product is that it was definitely interesting.”
In this case, the word “interesting” plays a different and more complex role than
the classical one.
I have an MSc degree in AI and after working several years as a web developer,
I became intrigued about building real-world AI systems. As I studied this field,
I had to deal with irreproducible, incomplete and hard to grasp research papers.
Sure, there are a lot of ML and NLP books out there, but here’s what I hated about
most books: they only present one part of the problem. This made me start
writing a blog and after researching a bunch of sources, I’ve succeeded to build
some complete, concise and compilable (as in runnable) code samples.
NLP for Hackers has emerged from the great response I had to my blog and it
contains several recipes on how to solve common problems regarding Natural
Language Processing.
This is not your typical research-oriented book that exposes the theoretical ap-
proach and uses clean datasets that you can only find in introductory courses and
never in the real world. This is a hands-on, practical course on getting started
with Natural Language Processing and learning its key concepts while coding. No
guesswork required.
Here’s where I think most books fall short:
• They don’t train useful models, like a part of speech tagger or a named entity
extractor (We’re going to do both here). The process of training this kind of
models remains a mystery for most people.
• They use perfect datasets (from the academia) which are useless in the real
world. In the real world, data is not perfect.
• They almost never explain the data gathering process.
• The systems built usually remain stuck in a Python script or a Jupyter
notebook. They rarely get wrapped in an API that can be queried and plugged
into a real system.
Throughout the book, you’ll get to touch some of the most important and practical
areas of Natural Language Processing. Everything you do will have a working
result. Here are some things you will get to tackle:
The book contains complete code snippets and step-by-step examples. No need to fill
in the blanks or wonder what the author meant. Everything is written in concise,
easy-to-read Python3 code.
• Learning NLP - the API is simple & well-designed, full of good didactic exam-
ples
• Prototyping - before building performant systems, you should build a proto-
type so you can prove how cool your new app is.
PROS CONS
Perfect for getting started with Doesn’t come with super powerful
Natural Language Processing pretrained models (there are
other frameworks out there that
provide far-better models)
Installing NLTK
I remind you that this book requires basic Python programming experience and
package management. That being said, the best way to get started is to create a
separate Python3 virtual environment:
Install nltk
After installing, you need to download all the NLTK data. That includes the various
corpora, grammars and pre-trained models. Open a Python console and run:
import nltk
nltk.download('all')
The best way to read this book is to write the code as you progress. You can either
create a simple Python3 file and write the code from every section, or if you have
experience, you can create Jupyter Notebooks for a more interactive experience.
Splitting Text
Until now, you probably haven’t put a lot of thought into how difficult can the task of
splitting text into sentences and/or words be. Let’s identify the most common issues
using a sample text from NLTK and see how the NLTK sentence splitter performs in
various situations.
print(reuters.raw('test/21131')[:1000], '...')
Let’s pick a text from a news article that includes a few examples of the difficulties
one may encounter:
AMPLE SUPPLIES LIMIT U.S. STRIKE’S OIL PRICE IMPACT Ample supplies
of OPEC crude weighing on world markets helped limit and then reverse oil
price gains that followed the U.S. Strike on an Iranian oil platform in the Gulf
earlier on Monday, analysts said. December loading rose to 19.65 dlrs, up 45
cents before falling to around 19.05/15 later, unchanged from last Friday.
“Fundamentals are awful,” said Philip Lambert, analyst with stockbrokers
Kleinwort Grieveson, adding that total OPEC production in the first week of
October could be above 18.5 mln bpd, little changed from September levels.
Peter Nicol, analyst at Chase Manhattan Bank, said OPEC production could
be about 18.5-19.0 mln in October. Reuter and International Energy Agency
(IEA) estimates put OPEC September production at 18.5 mln bpd. The U.S.
Attack was in retaliation of last Friday’s hit of a Kuwaiti oil products tanker
flying the U.S. Flag, the Sea Isle City. …
I chose this particular text from a news article because it emphasizes how humans
split into sentences by using a lot of prior knowledge about the world. Taking it
even deeper, this prior knowledge can’t be seen as a massive library of information
because it is filtered by the author’s subjectivism. This being said, here are some
challenges you may face when splitting text into sentences:
• not all punctuation marks indicate the end of a sentence (from the example
above: “U.S.”, “19.65”, “19.05/15”, etc. )
• not all sentences end with a punctuation mark (e.g. text from social platforms)
• not all sentences start with a capitalized letter (e.g. some sentences start with
a quotation mark or with numbers)
• not all capitalized letters mark the start of a sentence (from the example above:
“The U.S. Attack”, “U.S. Flag”)
Now that we’ve identified some difficulties, let’s see how the NLTK sentence splitter
works and how well it performs by analyzing the results we get.
Introducing nltk.sent_tokenize
import nltk
from nltk.corpus import reuters
sentences = nltk.sent_tokenize(reuters.raw('test/21131')[:1000])
print("#sentences={0}\n\n".format(len(sentences)))
for sent in sentences:
print(sent, '\n')
nltk.sent_tokenize results
#sentences=8
Reuter and
International Energy Agency (IEA) estimates put OPEC September
production at 18.5 mln bpd.
The U.S.
As you can see, the results are not perfect: the last three sentences are actually
a single sentence. The NLTK sentence splitter was fooled by the full stop and the
capitalized letter.
The NLTK splitter is a rule-based system that keeps lists of abbreviations, words that
usually go together and words that appear at the start of a sentence. Let me show
you an example where the NLTK splitter doesn’t fail:
import nltk
# ['The U.S. Army is a good example.'] - only one sentence, no false splits
Let’s talk about splitting sentences into words. The process is almost the same, but
the solution for word tokenization is easier and more straightforward.
Introducing nltk.word_tokenize
import nltk
• split standard contractions, e.g. “don’t” → “do n’t” and “they’ll” → “they ‘ll”
• treat most punctuation characters as separate tokens
• split off commas and single quotes, when followed by whitespace
• separate periods that appear at the end of line
Building a vocabulary
You will often want to compute some word statistics. NLTK has some helpful classes
to quickly compute the metrics you’re after, like nltk.FreqDist. Here’s an example of
how you can use it:
Build a vocabulary
import nltk
fdist = nltk.FreqDist(nltk.corpus.reuters.words())
# get the words that only appear once (these words are called hapaxes)
print(fdist.hapaxes())
# Hapaxes usually are `mispeled` or weirdly `cApiTALIZED` words.
You’ll hear about bigrams and trigrams a lot in NLP. There are nothing but pairs or
triplets of adjacent words. If we would generalize the term to bigger lengths, we get
ngrams.
Ngrams are used to build approximate language models, but they are also used in
text classification tasks or as features for various other natural language statistical
models. Bigrams and trigrams are especially popular because usually going further
in size, you don’t get any significant performance boost but rather a more complex
model. Here are some shortcuts to work with:
bigram_measures = BigramAssocMeasures()
trigram_measures = TrigramAssocMeasures()
# return the 50 bigrams with the highest PMI (Pointwise Mutual Information)
print(finder.nbest(bigram_measures.pmi, 50))
# among the collocations we can find stuff like: (u'Corpus', u'Christi') ...
If you’d take a look at the results, you’ll recognize that most of the collocations are
in fact real world entities like: people, companies, events, documents, laws, etc.
Part of Speech Tagging (or POS Tagging, for short) is probably the most popular chal-
lenge in the history of NLP. POS Tagging basically implies assigning a grammatical
label to every word in a sequence (usually a sentence). When I say grammatical
label, I mean: Noun, Verb, Preposition, Pronoun, etc.
In NLP, a collection of such labels is called a tag set. The most widespread one is:
Penn Treebank Tag Set3 . Below is the alphabetical list of part-of-speech tags used in
the Penn Treebank Project:
Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
3 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Tag Description
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
Your instinct now might be to run to your high school grammar book but don’t
worry, you don’t really need to know what all those POS tags mean. In fact, not even
all corpora implement this exact tag set, but rather a subset of it. For example, I’ve
never encountered the LS (list item marker) or the PDT (predeterminer) anywhere.
Part-Of-Speech tagging also serves as a base of deeper NLP analyses. There are just
a few cases when you’ll work directly with the tagged sentence. A scenario that
comes to mind is keyword extraction, when usually you only want to extract the
adjectives and nouns. In this case, you use the tags to filter out those words that
can’t be a keyword.
If you’re not familiar with the task, nowadays, POS tagging is done with machine
learning models. But a while ago, POS taggers we’re rule-based. Using regular
expressions and various heuristics, the POS tagger would determine an appropriate
tag. Here’s an example of such subset of rules:
IF word in (I, you, he, her, it, we, them) THEN tag='PRONOUN'
It may seem like a rather massive oversimplification, but this is the general idea.
The challenge was to come up with rules that best described the phenomena that is
the human language. But, as you go deeper and deeper, the rules become more and
more complex, getting harder to keep track with which rules did what, if there were
rules that were contradicting one another etc. Moreover, humans don’t usually
excel at noticing correlations between a great number of variables. In this case,
the variables could have been:
• Previous words
• Following words
• Prefixes, Suffixes
• Previous POS tags
• Word capitalization
Mathematical algorithms are better at estimating which are the optimum rules
for correctly tagging words. Thus, the field turned to machine learning and the
approach became something like this:
1. Get some humans to annotate some texts with POS tags (we’ll call this the gold
standard)
2. Get other humans to build some mathematical models to predict tags using a
large part of the gold standard corpus (this is called training the model)
3. Using the remaining part of the gold standard corpus, assess how well the
model is performing on data the model hasn’t seen yet (this is called testing
the model)
This may seem complicated, and it usually is presented as such, but throughout this
book, we’ll be demistifying all of these algorithms. To start doing POS tagging we
don’t need much because NLTK comes with some pre-trained POS tagger models.
It’s super easy to get started and here’s how:
Introducing nltk.pos_tag
import nltk
tagged_tokens = nltk.pos_tag(tokens)
It is as straight-forward and easy as this, but keep in mind that the NLTK trained
model is not the best. It’s a bit slow and not the most precise. It is well suited though
for doing toy projects or prototyping.
Named Entity Recognition (NER for short) is almost as well-known and studied as
POS tagging. NER implies extracting named entities and their classes from a given
text. The usual named entities we’re dealing with stand for: People, Organizations,
Locations, Events, etc. Sometimes, things like currencies, numbers, percents, dates
and time expressions can be considered named entities even though they techni-
cally aren’t. The entities are used in information extraction tasks and usually, these
entities can be attributed to a real-life object or concept. To make things clearer, let
me give you some examples:
print(ner_annotated_tree)
# (S
# The/DT
# closing/NN
# chapter/NN
# ,/,
# is/VBZ
# adapted/VBN
# from/IN
# the/DT
# address/NN
# that/IN
# (PERSON Feynman/NNP)
# gave/VBD
# during/IN
# the/DT
# 1974/CD
# commencement/NN
# exercises/NNS
# at/IN
# the/DT
# (ORGANIZATION California/NNP Institute/NNP Of/IN Technology/NNP)
# ./.)
Notice that the result has a single or multiple tokens bundled up in entities:
1. Feynman = PERSON
Also notice that we needed to POS tag the sentence first before feeding it to the
ne_chunk function. This is because the function uses the POS tags as features that
contribute decisively to predicting whether something is an entity or not. The most
accessible examples are the tags NNP and NNPS. What do you think is the reason these
tags help the NE extractor find entities?
Let’s pay a closer look at the ne_chunk function. Chunking means taking the tokens
of a sentence and grouping them together, in chunks. In this case, we grouped the
tokens belonging to the same named entity into a single chunk. The data structure
that facilitates this is nltk.Tree. The tokens of a chunk are all children of the same
node. All the other non-entity nodes and the chunk nodes are children of the same
root node: S. Here’s a quick way to visualize these trees:
Drawing a nltk.Tree
import nltk
The same things that we’ve said about the NLTK tagger are true for the NLTK Named
Entity Chunker: it’s not perfect.
In the next section we will explore another valuable resource that’s bundled up in
NLTK: Wordnet.
Wordnet Structure
You can think of Wordnet as a graph of concepts. The edges between the concepts
describe the relationship between them. A concept node is, in fact, a set of synonyms
representing the same concept, and is called a Synset. The synonyms within the
synset are represented by their Lemma. A Lemma is the base form (the dictionary
form) of the word.
Using this type of structure, we are able to build mechanisms that can reason. Let’s
take an example: I can say either “Bob drives the car” or “Bob dives the automobile”,
both being true because the nouns “car” and “automobile” are synonyms. Also, “Bob
drives the vehicle” can be true since all cars are vehicles.
NLTK provides one of the best interfaces for Wordnet. It is extremely easy to use
and before we dive into it, make this exercise: try to write down all the meanings
of the word “car”.
4 https://wordnet.princeton.edu/
#
# ['car', 'elevator_car']
# where passengers ride up and down
#
# ['cable_car', 'car']
# a conveyance for passengers or freight on a cable railway
The most important relationships in Wordnet other than the synonymy relation-
ship are the hyponymy/hypernymy. Hyponyms are more specific concepts while
hypernyms are more general concepts.
Let’s build a concept tree. Due to aesthetics, we won’t use all the concepts because
it would be hard to visualize. I have conveniently chosen only 4 hyponyms of the
concept Vehicle and 5 more for its Wheeled-Vehicle hyponym:
import nltk
from nltk.corpus import wordnet as wn
t.draw()
Concept Tree
Maybe it’s not obvious from the previous queries, but synsets have an associated
part of speech. You can see it in the visualization on node labels: vehicle.n.01 where
n stands for noun. We can perform generic queries without specifying the part of
speech like this:
print(wn.synsets('fight'))
# [
# Synset('battle.n.01'),
# Synset('fight.n.02'),
# Synset('competitiveness.n.01'),
# Synset('fight.n.04'),
# Synset('fight.n.05'),
# Synset('contend.v.06'),
# Synset('fight.v.02'),
# Synset('fight.v.03'),
# Synset('crusade.v.01')
# ]
We can also perform particular queries, specifying the part of speech like this:
print(wn.synsets('fight', wn.NOUN))
# [
# Synset('battle.n.01'),
# Synset('fight.n.02'),
# Synset('competitiveness.n.01'),
# Synset('fight.n.04'),
# Synset('fight.n.05')
# ]
Moreover, we can perform a query for a very specific synset, like this one:
Lemma Operations
Until now, we’ve been looking into Wordnet synsets relationships, but lemmas have
some interesting properties as well. To begin with, the lemmas in a synset are sorted
by their frequency, like this:
Lemma operations
talk = wn.synset('talk.v.01')
print({lemma.name(): lemma.count() for lemma in talk.lemmas()})
# {'talk': 108, 'speak': 53}
There are fundamental differences between these two methods, each having ad-
vantages and disadvantages. Here’s a short comparison:
• Both stemmers and lemmatizers bring inflected words to the same form, or at
least they try to
• Stemmers can’t guarantee a valid word as a result (in fact, the result usually
isn’t a valid word)
• Lemmatizers always return a valid word (the dictionary form)
• Stemmers are faster
• Lemmatizers depend on a corpus (basically a dictionary) to perform the task
When stemming, what you actually do is to apply various rules to a specific word
form until you reduce the word to its basic form and no other rule can be applied.
This method is fast and also, the best choice for most real-world applications. The
downside of stemming is the incertitude of the result because it usually isn’t a valid
word, but on the bright side, you can have two related words which can resolute to
the same stem. Let’s see an example:
Snowball Stemmer
stemmer = SnowballStemmer('english')
You can think of lemmatizers as a huge Python dictionary where the word forms
are the keys and the base form of the words are the values. Because two words
with different parts of speech can have the same form, we need to specify the part
of speech.
Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Resolve to `friend`
print(lemmatizer.lemmatize('friend', 'n'))
print(lemmatizer.lemmatize('friends', 'n'))
print(lemmatizer.lemmatize('slowly', 'r'))
• Rule Based Systems: given a set of rules, figure out which ones apply. The
process would look like this: apply the rules, get the results and by interpreting
them, take one decision or another.
• Case Based Reasoning: given a list of previous cases and the correct label, get
the most similar ones and figure out a way to use the labels of these cases to
come up with a label for a new given case.
• Knowledge Based Reasoning: use a graph-like taxonomy of the world to
perform reasoning (e.g. if all mammals give birth to living babies & dolphins
are underwater animals & dolphins are mammals, then there are some under-
water animals that give birth to alive babies)
These are great approaches, but here’s the main flow for all of them: they are as
smart as the one programming them. They also have a limited view of the world
and and a small amount of expressivity. Moreover, they lack flexibility and need
human input all along the way (e.g. someone has to notice if a rule doesn’t perform
well, someone has to figure out what’s wrong with it and fix it and, furthermore,
someone can make it either more general or more specific, or split it into two or
more rules).
Machine learning is basically a way of finding out what are the best rules for some
given data and also a way of updating the system when we get new information.
Let’s take an example to better understand this concept. Suppose we want to predict
the weather and, from a classic approach, we can build the following rules:
• if in the previous day it rained and today the sky is cloudy then today it will
rain
• if today the humidity is above a threshold T and the sky is cloudy then today it
will rain
• otherwise it won’t rain today
Though I have used some common sense when creating these rules, they have no
real-world foundation. I mainly made them up. If I were to implement this system, it
might have certain predictive powers, thus presenting some intelligent behaviour.
Now let’s say that for one particular day the conditions for the first rule were
accurate (previous day was rainy & today the sky is cloudy), but also in that
particular day the temperature was very low so it didn’t rain, thus, the prediction
was wrong. In a classical AI approach, we would manually correct the rule: one
needs to get into the system, identify which rule predicted the rain and figure
out what was the parameter that wasn’t taken into consideration (in this case, the
temperature). In addition, one has to figure out what would be a good threshold
for temperature (like humidity, in the second rule). As you can imagine, this can
become pretty unmanageable.
Taking the same example (predicting the weather), but from the machine learning
approach, the process would be totally different. First, we would sit back and relax
for a year and just collect weather data. Secondly, we would build a mathematical
model that minimizes prediction error because we now have data. Then we would
put the system online and keep monitoring the weather, feeding the system with
new data to better itself.
So, what’s the problem now? Well, machine learning systems have this one big issue:
they need data. It may not seem like an issue at first, but depending on what system
you want to build, collecting data may be hard. It may take a long time to gather it
and it can be really expensive. Think for example about the data needed to model
the effects of total solar eclipses on deep ocean fish migration patterns.
We, humans, perform very similar data collection and labelling processes all the
time without realizing it. As we age, the processes become extremely convoluted.
We take into account so many parameters that we can’t even keep track of. It’s
interesting to observe this on small kids because they haven’t experienced the
world for so long, so they haven’t yet perceived all of its facets. I know one cute
baby that used to be friendly with everyone. For that baby, everyone had a GOOD label
attached. One time, the baby got a bit sick so he had to be taken to the hospital. A
nurse that was wearing a white medical gown gave him a shot. This was his first
BAD label. Some days later, a neighbour wearing a white t-shirt visited the baby.
Instantly, the baby began crying. From the baby’s experience, people wearing white
are labelled as BAD.
Getting back to our weather example, such an application for predicting weather
will need to work in a similar way by learning from experience, in its case:
data. When we are talking about real-world systems, this gets even more difficult:
the system might not experience all possible temperature or humidity values,
hardware may malfunction and collect incorrect or incomplete data, there can be
errors in measurements, etc. Generally speaking, real-world systems usually can’t
experience the full spectrum of valid inputs and here’s why:
Moving one step deeper: what happens when we need to predict the weather given
some meteorological conditions that do not match any of previously seen data?
Hard to say. Get the closest labelled data point and use the same label? Maybe.
Get the most common label of the 10 most similar cases? Possibly. Do something
different altogether? That’s definitely a possibility.
The process we’re trying to simulate here is called generalization, and the gist of it
is: we don’t need to know everything in the world to have a pretty good idea about
something. We can just fill in the blanks. We can make informed guesses about data
we have never seen before, that is somewhat alike to other data we have seen. This
is Machine Learning.
A while ago, I have written on my blog a post about a practical Machine Learning
example. I will use the same example in this book, because it fits very well. We
will get started with Machine Learning by building a predictor that is able to tell
if a name is a girl name or a boy name. I think that this example is one of the best
starting points, and here’s why:
• it’s works on text and is thus, representative for this book’s subject
5 https://en.wikipedia.org/wiki/Real_number
• it forces you to do actual feature extraction, rather than getting them from a
table
• it models a problem that is familiar rather than dealing with numbers or
abstract concepts
But before diving into the ML example, I want to continue the analogy with human
reasoning from the previous example (predicting the weather). Building the rules
for the weather prediction system, we took into consideration things like:
It is obvious that we’re not taking into consideration all the information we have.
We don’t care that 100 days ago it rained, 101 days ago it didn’t, and 102 days ago it
poured cats and dogs. Some information might be relevant, but we don’t use it. We
try to simplify the world we live in, aiming only to extract some of its features.
Now let’s dive into the machine learning problem, starting by examining the data a
bit:
Names Corpus
import collections
from nltk.corpus import names
# We know that there are a lot of girl names that end in `a`
# Let's see how many
girl_names_ending_in_a = [name for name in girl_names if name.endswith('a')]
print('#GirlNamesEndingInA=', len(girl_names_ending_in_a)) # 1773
# I'm not sure what's the most common last letter for English boy names
Here’s a problem: the 2nd and 3rd most common last letters are the same for both
genders. Let’s see how to deal with that.
First, we will build a predictive system and analyze its results. This type of system
is called a classifier. As you can guess, the role of a classifier is to classify data. The
idea is pretty basic: it takes some input and it assigns that input to a class. At first,
we are going to use the NLTK classifier because it is a little bit easier to use. After we
clarify the concepts, we will move to a more complex classifier from Scikit-Learn.
Build the first name classifier
import nltk
import random
from nltk.corpus import names
def extract_features(name):
"""
Get the features used for name classification
"""
return {
'last_letter': name[-1]
}
print(name_classifier.classify(extract_features('Gaga'))) # girl
print(name_classifier.classify(extract_features('Joey'))) # girl
# Sorry Joey :(
# Test how well it performs on the test data:
# Accuracy = correctly_labelled_samples / all_samples
print(nltk.classify.accuracy(name_classifier, test_data)) # 0.7420654911838791
At this point, in our example, we only have one input (the last letter) and two classes
(boy/girl). Turns out that our experiment with the four names (Bono, Latiffa, Gaga
and Joey) was pretty good since one in four names got misclassified.
Using the nltk.DecisionTreeClassifier class we’re effectively building a tree, where
nodes hold conditions that try to optimally classify the training data. Here’s how
ours looks like:
It may not seem like much of a tree at this point (it only has one level), but don’t
worry, by the time we’ll be done with it, it will.
Let’s add another feature: the number of vowels in the name. In order to do this,
we need to change the extract_features function and retrain.
After retraining we got a 0.77 accuracy (77%). That means we have a 3% boost. Let’s
see the new tree which now has two levels:
At this point, we can notice that more features are somehow better, allowing us to
capture different aspects of a given sample. Let’s add a few more and see how far
it takes us:
def extract_features(name):
"""
Get the features used for name classification
"""
features = {
# Last letter
'last_letter': name[-1],
# First letter
'first_letter': name[0],
# How many vowels
'vowel_count': len([c for c in name if c in 'AEIOUaeiou'])
}
# Build letter and letter count features
for c in string.ascii_lowercase:
features['contains_' + c] = c in name
features['count_' + c] = name.lower().count(c)
return features
I got a 0.79 (79%) accuracy. Please be aware that the numbers may be slightly
different in your case and that’s due to the randomization of the training/test set.
Ok, now that we got comfortable with these concepts, let’s repeat this process using
Scikit-Learn models and see if we can do better.
• provides a greater diversity of models available, whereas NLTK offers only im-
plementations for: MaxentClassifier, DecisionTreeClassifier and NaiveBayesClas-
sifier
• provides consistent API for data preprocessing, vectorization and training
models.
• has extensively documented API and various tutorials around
• is faster than most alternatives due to the fact that it is built upon numpy
• provides some parallelization features
About NumPy
NumPy is a scientific computing package with datastructures optimized for
fast numeric operations
Now let’s build the dataset using a useful sklearn utility: train_test_split.
def extract_features(name):
"""
Get the features used for name classification
"""
return {
'last_letter': name[-1]
}
We are going to rewrite the training process using the equivalent scikit-learn
models. Here are the main building blocks of the scikit-learn API:
There are two main types of scikit-learn components:
1. Transformers
2. Predictors
dict_vectorizer = DictVectorizer()
name_classifier = DecisionTreeClassifier()
I encourage you to retrain the classifier using the extended feature extraction
function.
At this point, some concepts might seem ambiguous. For instance, what gets passed
between the DictVectorizer and DecisionTreeClassifier? Well, Scikit-Learn models
only work with numeric matrices. The role of vectorizers is to convert data from
various formats to numeric matrices. If we take a look at output type of a vectorizer
we will get this: scipy.sparse.csr.csr_matrix. This is a sparse matrix, which is a
memory-efficient way of storing a matrix. Unfortunately, it’s a bit hard to read such
a matrix. However, we need to be able to work with it and to understand this data
format and here’s how we can do that:
vectorized = np.zeros(len(dict_vectorizer.feature_names_))
# Contains 3 `w`s
vectorized[dict_vectorizer.feature_names_.index('count_w')] = 3.0
vectorized[dict_vectorizer.feature_names_.index('count_b')] = 1.0
vectorized[dict_vectorizer.feature_names_.index('count_i')] = 1.0
print(vectorized)
# Let's see what we've built:
# [ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
# Let's now apply the inverse transformation, to get the feature dict
print(dict_vectorizer.inverse_transform([vectorized]))
# [{'contains_a': 1.0, 'contains_b': 1.0, 'contains_w': 1.0,
# 'count_a': 1.0, 'count_b': 1.0, 'count_w': 3.0, 'last_letter=a': 1.0}]
Making Predictions
As we covered in the Scikit-Learn API summary, predictions can be made using the
predict method. This is an important step in our process, but keep in mind that you
can call this method only if you previously called the fit method on a trained model.
Making predictions
Existing corpora
There is a limited number of existing corpora that we could use to achieve our goal.
As you can notice, the categories we got aren’t that helpful for our task. They’re
narrow in scope and don’t cover the entire news spectrum as we would like.
This corpora is mostly used in benchmarking tasks rather than in real-world
applications. We can do better by gathering our own data, but bare in mind that
usually building a corpus is a tedious and expensive endeavour. If you start from
scratch, you need to manually go through a lot of news articles and blog posts
and pick the most relevant category for each of them. Most the tutorials on text
classification use bogus data, e.g. for doing sentiment analysis, or use an existing
corpora that will produce in a useless model with no practical application. In the
following section, we will take another path and explore some other ideas for
gathering data.
The goal here is to build a general tiny corpus. This task is complex and but is
only adjacent to our main goal of building a news classifier, so we will use some
shortcuts. For practical reasons, we could look at these web resources:
• Project Gutenberg Categories6 – this can prove useful depending on the domain
you are trying to cover. You will need to write a script to download the books,
transform them into plain text and use them to train your classifier.
• Reddit – This is a good source of already categorized data. Pick a list of
subreddits, assign them to a category, crawl the subreddits and extract the
links. Consider each of all the articles extracted from the subreddit as belonging
to the same category Here’s a list of all subreddits7
• Use the Bing Search API8 to get relevant articles for your categories
The general idea is to find places on the Internet where content is placed in prede-
fined buckets or categories. These buckets can then be assigned to a category from
your own taxonomy. Obviously, the process is quite error-prone. After gathering
the data, I suggest looking at some random samples and assessing the percentage
of it that’s correctly categorized. This will give you a rough idea of the quality of
your corpus.
Another trick I like to use after collecting the data is building a script that goes
through the labelled data and asks if the sample is correctly classified. Don’t stress
too much on the interface, its only purpose is to do the labelling. A command line
interface that accepts y/n input, or even a Tinder-like system will do the trick.
If the numbers allow you, you can then go manually through the samples that
aren’t correctly classified and fix them. It may seem like a lot of work, but keep
in mind that if it’s done right, it can save you a lot of time, especially given that the
alternative is to search for articles yourself and manually assign the appropriate
label.
Getting back to our task at hand, we’ll use a different web resource for building
our corpus: the web bookmarking service Pocket9 . This service offers an explore
feature10 that requires us to input clear, unambiguous queries in order to get well-
6 https://www.gutenberg.org/wiki/Category:Bookshelf
7 https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits
8 https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/
9 https://getpocket.com
10 https://getpocket.com/explore/
• data is socially curated and highly qualitative: the articles suggested by Pocket
are bookmarked by a big number of users
• data is current: suggestions are frequently added to the service
• data is easy to gather: the explore feature can be easily crawled and, at this
moment, it doesn’t seem to block crawlers
If you want to skip the corpus creation step, I already prepared it in advance and
you can download from here: Text Classification Data11
If you want to get your hands dirty and do it anyway, here’s how we go about it.
First, let’s figure out which should be the categories and then proceed to collecting
the data.
# If you have specific needs for your corpus, remember to adjust these categories and keywords accordingly.
CATEGORIES = {
'business': [
"Business", "Marketing", "Management"
],
'family': [
"Family", "Children", "Parenting"
],
'politics': [
"Politics", "Presidential Elections",
"Politicians", "Government", "Congress"
],
'sport': [
"Baseball", "Basketball", "Running", "Sport",
"Skiing", "Gymnastics", "Tenis", "Football", "Soccer"
],
'health': [
"Health", "Weightloss", "Wellness", "Well being",
"Vitamins", "Healthy Food", "Healthy Diet"
],
'economics': [
"Economics", "Finance", "Accounting"
],
'celebrities': [
"Celebrities", "Showbiz"
],
'medical': [
"Medicine", "Doctors", "Health System",
"Surgery", "Genetics", "Hospital"
],
'science & technology': [
"Galaxy", "Physics",
"Technology", "Science"
],
'information technology': [
"Artificial Intelligence", "Search Engine",
11 https://www.dropbox.com/s/715w0myrjdmujox/TextClassificationData.zip?dl=0
• querying the service and scraping the article URLs from the page using beauti-
fulsoup412 .
• iterating through the links and fetching the content of the articles using news-
paper3k13 a library that helps us extract only the main content of a webpage.
• save everything, including the category, in a dataframe and dumping it in a CSV
file.
import uuid
import atexit
import urllib
import random
import requests
import pandas as pd
from time import sleep, time
from bs4 import BeautifulSoup
from newspaper import Article, ArticleException
POCKET_BASE_URL = 'https://getpocket.com/explore/%s'
@atexit.register
def save_dataframe():
""" Before exiting, make sure we save the dataframe to a CSV file """
dataframe_name = "dataframe_{0}.csv".format(time())
df.to_csv(dataframe_name, index=False)
# Shuffle the categories to make sure we are not exhaustively crawling only the first categories
categories = list(CATEGORIES.items())
random.shuffle(categories)
url = title_html.a['data-saveurl']
print("Indexing article: \"{0}\" from \"{1}\"".format(title, url))
try:
article = Article(url)
article.download()
12 https://pypi.python.org/pypi/beautifulsoup4
13 https://pypi.python.org/pypi/newspaper3k/0.2.2
article.parse()
content = article.text
except ArticleException as e:
print("Encoutered exception when parsing \"{0}\": \"{1}\"".format(url, str(e)))
continue
if not content:
print("Couldn't extract content from \"{0}\"".format(url))
continue
This is an extensive process and it will take a while for everything to download. If
you can’t leave it running overnight, you can run the script several times and then
concatenate the results. Remember, if you want to get the data directly, or you want
to work on the exact same data as me, you can download the data from here: Text
Classification Data14
Here’s how the distribution of the samples included in the above link looks like:
14 https://www.dropbox.com/s/715w0myrjdmujox/TextClassificationData.zip?dl=0
How much wood does a woodchuck chuck if a woodchuck could chuck wood
This is how this sentence would look transformed into the feature space:
{
'wood': 2,
'a': 2,
'woodchuck': 2,
'chuck': 2,
'wow': 1,
'much': 1,
'does': 1,
'if': 1,
'could': 1
}
import collections
text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""
print(collections.Counter(text.lower().split()))
One thing that might throw you off at this point is that this method doesn’t take
into consideration the order of the words inside a text. This is one of the known
drawbacks of this method. Using word counts is an approximation and this type of
approximation is called Bag of Words.
There are methods that deal with actual sequences, such as Hidden Markov Models
or Recurrent Neutral Networks, but using these type of methods are not the subject
of this book.
Bag of Words models are really popular and widely used because they are sim-
ple and can perform pretty well. We can get even better approximations of the
sequence using bigram or trigram models.
As we discussed in the first chapters, a bigram is a pair of adjacent words inside a
sentence and a trigram is a triplet of such words. Here’s how to compute them for
a given text using nltk utils:
import nltk
import collections
from pprint import pprint
text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""
bigram_features = collections.Counter(
list(nltk.bigrams(text.lower().split())))
trigram_features = collections.Counter(
list(nltk.trigrams(text.lower().split())))
pprint(bigram_features)
pprint(trigram_features)
{
('a', 'woodchuck'): 2,
('how', 'much'): 1,
('much', 'wood'): 1,
('wood', 'does'): 1,
('does', 'a'): 1,
('woodchuck', 'chuck'): 1,
('chuck', 'if'): 1,
('if', 'a'): 1,
('woodchuck', 'could'): 1,
('could', 'chuck'): 1,
('chuck', 'wood'): 1
}
{
('how', 'much', 'wood'): 1,
('much', 'wood', 'does'): 1,
('wood', 'does', 'a'): 1,
('does', 'a', 'woodchuck'): 1,
('a', 'woodchuck', 'chuck'): 1,
('woodchuck', 'chuck', 'if'): 1,
('chuck', 'if', 'a'): 1,
('if', 'a', 'woodchuck'): 1,
('a', 'woodchuck', 'could'): 1,
('woodchuck', 'could', 'chuck'): 1,
('could', 'chuck', 'wood'): 1
}
Some things to have in mind before we move forward: using bigrams and trigrams
makes the feature space way larger because of the combinatorial explosion, mean-
ing:
Now that we’ve covered how a text gets transformed to features, we can move on to
how this is actually done in practice. Scikit-Learn has special vectorizers for dealing
with text that come in handy.
Scikit-Learn CountVectorizer
text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""
vectorizer = CountVectorizer(lowercase=True)
# (0, 0) 2
# (0, 1) 1
# (0, 2) 1
# (0, 3) 1
# (0, 4) 1
# (0, 5) 1
# (0, 6) 2
# (0, 7) 2
Analyzing the results, you will notice that the right column represents the counts of
the words which have exactly the same values as the ones we computed earlier with
the Counter function, while the left column represents the indices (sample_index,
word_index_in_vocabulary).
Scikit-Learn works mainly with matrices and almost all components require a
matrix as input. The vectorizer transforms a list of texts (notice how both fit
and transform get a list of texts as input) into a matrix of size (sample_count, vocab-
ulary_size). The purpose of the fit method is to compute the vocabulary (thus the
vocabulary size) by computing how many different words we have. Let’s find out
what happens when we use words outside the vocabulary:
Scikit-Learn CountVectorizer
As you can notice, we get an empty matrix with all values set to 0. That means
none of the known features (words) have been detected and that’s why our print
statement doesn’t output anything.
Scikit-Learn MultinomialNB
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Split the data for training and for testing and shuffle it
# keep 20% for testing, and use the rest for training
# shuffling is important because the classes are not random in our dataset
labels = data['category'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(
text_samples, labels, test_size=0.2, shuffle=True)
vectorizer = CountVectorizer(lowercase=True)
classifier = MultinomialNB()
For this task, I have opted for the Naive Bayes classifier as it’s one of the most
used and well-known types of classifiers for text. Moreover, even its documentation
states this:
Chances are, if you’ve ever went through a text classification tutorial, you have used
a Naive Bayes classifier.
After training, we got an accuracy of around 0.65 (or 65%). The accuracy may vary
a bit because we shuffle the dataset. This means we don’t train the classifier on
the same data every time we run the script. Getting a 65% accuracy might not seem
good enough, and frankly, it probably isn’t if you expect people to pay for your text
analysis service. However, it’s not a bad number and here’s why:
• The dataset size was pretty small. It’s hard to say what’s an appropriate dataset
size for this task but my recommendation would be at least a couple of tens of
thousands.
• The dataset is not gold quality, meaning it wasn’t humanly annotated. If you
analyze the dataset, you’ll see misclassified samples (even if you only look at
titles). This misclassified samples can damage the overall performance of the
classifier.
• A 65% accuracy on a 2-class problem is different than a 65% accuracy on a 25-
class problem (which is our case). If we were to select a random class from the
25 classes for every single sample, we’d only achieve a 4% accuracy. But if we
select randomly from the 2-class problem, we would achieve 50%. Therefore,
the fact that we were able to achieve 65% on our problem doesn’t sound that
bad. It’s a good practice to compare its performance to a random classifier’s
performance.
The next step now is to actually classify a text. Let’s see how to do that:
1. Training the model given some labelled data. Usually, the trained model has to
satisfy some acceptance criteria. That is why this step is often semi-automated
as it requires a certain amount of human intervention, such as adjusting
parameter or the training set.
2. Storing and loading the model. We need a way to save the trained model to disk
because we don’t want to retrain the model every time the server is restarted.
Our purpose is to load the trained model from disk, so we can use it later when
calling the API.
3. Using the model. The analysis step is usually done through the API and is 100%
automated. No need for human intervention. We just use the trained model to
classify given texts.
We have already covered steps 1 and 3, now let’s see how we can do step 2:
import time
from sklearn.externals import joblib
timestamp = int(time.time())
# Save the vectorizer
joblib.dump(vectorizer, './text_analysis_vectorizer_%s.joblib' % timestamp)
joblib is part of the Python scientific computing ecosystem. Among others, this
module provides methods for persisting numpy objects (like the ones inside our
classifiers and vectorizers) in an efficient manner. An alternative would be using
the classic pickle package.
Let’s load the model and use it:
Flask is the usual choice for building a hassle-free API. Let’s dive straight in by
installing Flask.
Now let’s create a file named api.py. This will hold our endpoint definition.
API code
app = Flask(__name__)
VECTORIZER_MODEL_PATH = './text_analysis_vectorizer_1504879283.joblib'
CLASSIFIER_MODEL_PATH = './text_analysis_classifier_1504879283.joblib'
@app.route("/classify", methods=['POST'])
def classify():
prediction = classifier.predict(
vectorizer.transform([request.data]))[0]
return jsonify(prediction=prediction)
$ export FLASK_APP=api.py
$ python -m flask run
Now you can use your favourite REST client to POST raw text to the /classify
endpoint. I usually use Postman15 for this task. Here is how it looks:
If this wasn’t a book for the hacker in you, we would have stopped here. But we are
going to take things a step further and we’ll actually deploy this as an online service
using Heroku.
After finishing this chapter, you’ll have a text analysis endpoint online and believe
it or not, people actually pay money for this kind of service.
15 https://www.getpostman.com/
First things first: create or login into your Heroku account, and download the
Heroku CLI16 , then create a requirements.txt file:
requirements.txt file
click==6.7
Flask==0.12.2
gunicorn==19.7.1
itsdangerous==0.24
Jinja2==2.9.6
MarkupSafe==1.0
numpy==1.13.1
scikit-learn==0.19.0
scipy==0.19.1
Werkzeug==0.12.2
Heroku needs a Procfile to know which one is the file that contains the application
initialization:
Before deploying to Heroku, you might also want to create a .gitignore file in order
to commit only relevant files:
.gitignore file
*.pyc
__pycache__
Deploy to Heroku
$ heroku login
$ git init
$ heroku create
If you followed all the steps correctly, you should get a Heroku application URL. I
got this one: https://guarded-basin-39818.herokuapp.com.
We’re ready now to test the online API:
• Positive / Negative
• Positive / Neutral / Negative
• Very-Positive / Positive / Neutral / Negative / Very-Negative
• Objective / Subjective
Sentiment Analysis is used in various applications and the most common are:
Due to its practical application, there are many more resources available for
Sentiment Analysis than for general text classification, and we are going to explore
a good deal of them in this chapter.
Be Aware of Negations
Let’s say you’re disappointed about a product you bought and when asked in a
survey about it, you leave the following comment:
Yes, sure … I’m going to buy ten more. Can’t wait! Going to the store right
now! NOT!
Well, you might have felt the irony, but computer algorithms and machine learning
models have difficulties in understanding humour and figurative speech.
Here’s a simpler example:
You can sense that “interesting” was used in this case with a negative connotation
(it suggests that the person was unpleasantly surprised) rather than in its classic
positive sense. The suspension points are an indicator of this subtlety. So, would be
fair to say that a ML model will interpret the comment in the same way if it had
this rule implemented (suspension points = slightly negative sense / unpleasantly
surprised)? Would this rule be accurate? Are the suspension points generally used
in this way? Well, no and even more, people tend to express humour and figurate
speech in many different ways.
What happens if a text expresses multiple feelings? What happens if the sentiments
expressed are for different objects (Multiple Sentiments) and what happens if they
are for the same object (Mixed Sentiments)? What’s the overall feeling of the text?
These are important questions to have in mind, especially when you consider the
practical side of Sentiment Analysis. Most of the reviews, comments, articles, etc.,
have more than one sentence and more than one sentiment expressed towards an
object or multiple objects.
Consider the following examples:
“The phone’s design is the best I’ve seen so far, but the battery can definitely
use some improvements”
“I know the burger is unhealthy as hell, but damn it feels good eating it!”
Which one expresses multiple sentiments and which one expresses mixed senti-
ments?
Non-Verbal Communication
Another area where computers still fall short is non-verbal communication. The
text we’re reading doesn’t contain any non-verbal cues. We, using experience,
usually attribute a certain tonality to a text. Did it ever happen to fight with
somebody in a chat application or email because you weren’t correctly interpreting
eachother’s tone?
Can you imagine how complex a system can became if it were to take into account
all these factors?
Twitter Corpora
Twitter has been noticed in the academic environment for a while now, and various
corpora from tweets have been created. Here are some of them:
Besides the ones stated so far, there is a multitude of other corpora related to
sentiment analysis and I am going to list some of the most well-known:
Let’s start by gathering the Twitter data and putting it all together. We are going to
use three out of the six resources I have mentioned earlier in the Twitter Corpora
chapter: NLTK Twitter Samples, Twitter Airline Reviews and First GOP Debate
Twitter Sentiment.
import pandas as pd
import matplotlib.pyplot as plt
####################################
#
# NLTK Twitter Samples
#
####################################
from nltk.corpus import twitter_samples
For the next step, you need to download the Twitter Airline Reviews corpus from
Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment28
26 http://sentiwordnet.isti.cnr.it/
27 https://code.google.com/archive/p/dataset/downloads
28 https://www.kaggle.com/crowdflower/twitter-airline-sentiment
####################################
#
# Twitter Airline Reviews
#
####################################
airline_tweets = pd.read_csv('./Tweets.csv')
Now let’s download the First GOP Debate Twitter Sentiment corpus from Kaggle and
process it as well: https://www.kaggle.com/crowdflower/first-gop-debate-twitter-
sentiment29
####################################
#
# First GOP Debate Twitter Sentiment
#
####################################
debate_tweets = pd.read_csv('./Sentiment.csv')
As a final step, let’s put everything together and see how many tweets we’ve got for
each polarity:
29 https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment
Observations
We now have the data all together, so let’s train a model to predict tweet sentiments
by taking the same approach as we did in the Text Analysis chapter.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)
vectorizer = CountVectorizer(lowercase=True)
classifier = MultinomialNB()
I got an accuracy of around 0.71 (71%). It’s not bad, but there are some small
adjustments we can make to obtain better results. We’ll see how that goes in the
next chapters.
Better Tokenization
How does the Vectorizer transform text into words? According to the documentation,
the vectorizer has two parameters of interest:
Let’s take the default value for the token_pattern which is r"(?u)\b\w\w+\b". and see
how it performs on this tweet:
print(re.findall(r"(?u)\b\w\w+\b", tweet))
# ['BeaMiller', 'didn', 'follow', 'me']
print(nltk.word_tokenize(tweet))
# ['@', 'BeaMiller', 'u', 'did', "n't", 'follow', 'me', ':', '(', '(']
We’ve tried both the regex token pattern and the nltk.word_tokenize, but we did
not get satisfactory results: none of them caught the emoticon (which is a huge
sentiment indicator) or the Twitter handle (which has no sentiment value).
Feel free to build a better performing regex, as it would be a good exercise at this
point.
Moving forward, let’s use another tokenizer that comes bundled with NLTK and see
how it performs: nltk.tokenize.TweetTokenizer
NLTK Tweet Tokenization
from nltk.tokenize.casual import TweetTokenizer
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet))
# ['@BeaMiller', 'u', "didn't", 'follow', 'me', ':(', '(']
tokenizer = TweetTokenizer(strip_handles=True)
print(tokenizer.tokenize(tweet))
# ['u', "didn't", 'follow', 'me', ':(', '(']
As you can see, it’s not perfect, but it’s slightly better. Let’s try our new fancy
tokenizer for the sentiment analysis task and see where it takes us:
# ...
from nltk.tokenize.casual import TweetTokenizer
tweet_tokenizer = TweetTokenizer(strip_handles=True)
# ...
# ...
I got around 0.78 accuracy (78%)! That a huge boost in accuracy with a very small
adjustment. Pretty cool :)!
One of the classifiers that is always worth trying out is the LogisticRegression one
from nltk. It is very versatile and especially good with text. The main advantage
of this classifier is that it doesn’t need any parameter adjustments, just like the
Naive Bayes we’ve been experimenting with. Only change you need to make to the
previous script to try this out is:
# ...
# ...
classifier = LogisticRegression()
# ...
The Scikit-Learn vectorizer API allows us to use ngrams rather than just words.
Remember what we’ve covered in the previous chapters about ngrams? It’s exactly
the same procedure. Instead of using only single-word features, we use consecutive,
multi-word features as well. Changes to the previous script to make this happen are
minimal:
# ...
vectorizer = CountVectorizer(lowercase=True,
tokenizer=tweet_tokenizer.tokenize,
ngram_range=(1, 3))
# ...
Using a Pipeline
Using a Pipeline
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize.casual import TweetTokenizer
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)
pipeline.fit(X_train, y_train)
Cross Validation
This strategy might seem a bit harder to grasp, but bare with me and we’ll get to the
bottom of it.
I stated earlier in the book that we need to keep the test data separate from the
train data, in order to not influence classifier’s output. Did you wonder why that is?
Well, if we test the system with the same data we trained on, obviously we would
get awesome results, but biased. In order to get valid results, we need to test the
system with data it hasn’t seen yet.
If we continuously tweak the parameters to improve the results on the test set, we
indirectly overfit the system on the test dataset. That would be an undesired result
because it makes the system worse at generalizing. That means that if we will test
on unseen data, outside of the test set, it will underperform. One way to fix this
problem is to keep some more data aside and never test on this data while we tune
the parameters. This type of data is called the Validation Set. After we’re satisfied
with the results on the test set, then and only then we use the Validation Set to check
how our system is doing on unseen data.
This approach has a huge drawback: Data is usually scarce, and we will be putting
even more data aside that’s not going to be used for training.
An approach for getting around this drawback would be doing Cross Validation.
This implies splitting the dataset into N folds. The system will be trained N times on
all the data, each time excluding a different fold, out of the N total ones.
At the end, the scores of all trains are averaged. This way we don’t waste much data.
Here’s an example of Cross Validation with N = 5 folds we do that:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from nltk.tokenize.casual import TweetTokenizer
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
Using the Cross Validation strategy, we’re still around 0.82 in accuracy.
Grid Search
As we’ve seen so far, there are quite a few parameters we can tune to improve
accuracy, and we have not explored that many yet. Moreover, there’s no way to
know for sure what will be the effects of tuning a parameter in a certain way. There’s
no exact algorithm for tuning a model. Mastering this implies curiosity and lots of
practice.
However, here’s a simple way of optimizing parameter combinations called Grid
Search. This technique implies using Cross Validation for every possible parameter
combination. That’s a lot of work, so it will take a while.
import pandas as pd
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize.casual import TweetTokenizer
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
classifier = GridSearchCV(pipeline, {
# try out different ngram ranges
'vectorizer__ngram_range': ((1, 2), (2, 3), (1, 3)),
# check if setting all non zero counts to 1 makes a difference
'vectorizer__binary': (True, False),
}, n_jobs=-1, verbose=True, error_score=0.0, cv=5)
The results of Grid searching tell us that the best decision is to go with word, bigram
and trigram features, but we should use binary values instead of counts.
One of the first points we’ve covered in this book was stemmers. As you remember,
stemmers are good for standardizing words to a root form. Ideally, if two words
are derived forms of the same root, they should have the same stem. This trick can
significantly reduce the number of features, helping us build less complex and more
robust models.
Another thing to try improving is negation handling. As you will remember,
Sentiment Analysis was performing poorly when negations were used. NLTK has a
utility function, mark_negation, that tries to solve this problem by marking all words
after a negation as negated until the end of the sentence or until another negation
is encountered.
We can now use GridSearchCV again, to decide what’s the best tokenization method
to choose, out of these four combinations:
stemmer = PorterStemmer()
def stemming_tokenizer(text):
return [stemmer.stem(t) for t in tweet_tokenizer.tokenize(text)]
def tokenizer_negation_aware(text):
return mark_negation(tweet_tokenizer.tokenize(text))
def stemming_tokenizer_negation_aware(text):
return mark_negation(stemming_tokenizer(text))
tweet = "@rebeccalowrie No, it's not just you. Twitter have decided to take that feature away :-("
print(tweet_tokenizer.tokenize(tweet))
# ['No', ',', "it's", 'not', 'just', 'you', '.', 'Twitter',
# 'have', 'decided', 'to', 'take', 'that', 'feature', 'away', ':-(']
print(stemming_tokenizer(tweet))
# ['No', ',', "it'", 'not', 'just', 'you', '.', 'twitter',
# 'have', 'decid', 'to', 'take', 'that', 'featur', 'away', ':-(']
print(tokenizer_negation_aware(tweet))
# ['No', ',', "it's", 'not', 'just_NEG', 'you_NEG', '.', 'Twitter',
# 'have', 'decided', 'to', 'take', 'that', 'feature', 'away', ':-(']
print(stemming_tokenizer_negation_aware(tweet))
# ['No', ',', "it'", 'not', 'just_NEG', 'you_NEG', '.', 'twitter',
# 'have', 'decid', 'to', 'take', 'that', 'featur', 'away', ':-(']
classifier = GridSearchCV(pipeline, {
'vectorizer__tokenizer': (
tweet_tokenizer.tokenize,
stemming_tokenizer,
tokenizer_negation_aware,
stemming_tokenizer_negation_aware,
)
}, n_jobs=-1, verbose=True, error_score=0.0, cv=5)
We got a small boost using a stemmer, up to 0.82 accuracy, while all the other options
performed poorly. The model we got now is good enough. We should save it to disk
and reuse it in the next steps:
If you don’t already have a Twitter app, go ahead and create one here: Twitter
Apps32 . For the purpose of this project, don’t set a callback_url, we won’t need it since
we’re not building a web application. Instead of getting a callback, we’ll be directed
to a page with a code we need to copy/paste in the console. After creating your app,
go to the Keys and Access Tokens tab and save your authentication credentials. We’ll
need them in the next part.
Let’s build a Tweepy Stream Listener that classifies tweets and saves the count for
each category. From time to time, the listener will report the insights it has collected:
Building a Tweepy Listener
import tweepy
import webbrowser
from collections import Counter
from sklearn.externals import joblib
class SentimentAnalysisStreamListener(tweepy.StreamListener):
def __init__(self, model_path):
self.counts = Counter()
self.sentiment_classifier = joblib.load(model_path)
super(SentimentAnalysisStreamListener, self).__init__()
30 https://dev.twitter.com/streaming/overview
31 http://tweepy.readthedocs.io/en/v3.5.0/index.html
32 https://apps.twitter.com
if __name__ == "__main__":
auth = tweepy.OAuthHandler('YOUR_TOKEN', 'YOUR_SECRET')
redirect_url = auth.get_authorization_url()
webbrowser.open_new(redirect_url)
verifier = input('Verifier:')
auth.get_access_token(verifier)
listener = SentimentAnalysisStreamListener(
model_path='./twitter_sentiment.joblib')
# this registers our Listener as a handler for tweets
stream = tweepy.Stream(auth=auth, listener=listener)
At the time of writing, there are a lot of tweets on the Trump subject. It’s a prolific
example to use as a simple Twitter filter. Here’s how the stats look like:
CorrectlyClassif iedSamples
Accuracy =
T otalSamples
Computing Accuracy
import numpy as np
print((PREDICTED_CLASSES == CORRECT_CLASSES).mean())
# 0.714285714286
Binary Classification
Let’s get back to our tweet sentiment classification task. For simplification we’re
going to reduce the problem to only two classes: positive and negative. Let’s train a
classifier and compute its accuracy:
data = pd.read_csv('./twitter_sentiment_analysis.csv')
# filter out the `neutrals`
data = data[(data['sentiment'].isin(['positive', 'negative']))]
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)
vectorizer = CountVectorizer(lowercase=True)
vectorizer.fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)
Let’s elaborate on the following example: say that we have a disease called DiseaseX
which affects only 0.1% of the population: 1 in every 1000 individuals suffers from
DiseaseX. We lock ourselves in a laboratory and after a few days of intense Machine
Learning work, we build a very accurate classifier for predicting the presence of the
DiseaseX. Our predictive system has an accuracy of 99.9%! The secret behind such
a high-performant system is: Always predict the absence of the disease.
Even though our system is super performant in terms of accuracy, in fact, it doesn’t
do anything. The system doesn’t predict any disease, so it won’t actually help
anybody. Thus, we need a different performance measure. Let’s introduce the
following notations:
Now, because accuracy has failed us, we want a measure that’s better suited for
such phenomena. The measure we are looking for doesn’t have to be inversely
proportional to the whole dataset size, as accuracy is. Ask yourself this question:
what is the probability of a positive sample to be detected by our system? Since
our system is the always-predict-the-absence-of-the-disease type, the probability
of a positive sample to be detected by our system is ZERO. This method is called
Recall and has the following formula:
TP
Recall =
TP + FN
Our system has great accuracy, but extremely low recall. If we would make a 180
degrees change and create a system that always predicts the presence of the disease,
we would have 0.1% accuracy, but 100% recall!
Now ask yourself another question: what’s the probability of actually having the
disease, if we predicted that the sample has the disease? This measure is called
precision and has the following formula:
TP
P recision =
TP + FP
In this case, the precision of the system is 0.1%. Obviously, this scenario is not ideal
either.
In order to avoid optimizing one or the other, we can use the F1-Score which is the
harmonic mean of the Recall and Precision:
2
F 1Score =
1 1
+
P recision Recall
Let’s go back to our Twitter Sentiment Analysis and see how well we did by looking
at these new measures. Usually, Sentiment Analysis is applied in customer support,
because we are more interested in solving complaints than reponding to praises.
For this reason, the positive class should be represented by the negative samples:
print(recall_score(
y_test, classifier.predict(X_test_vectorized), pos_label='negative'))
# 0.945172564439
print(precision_score(
y_test, classifier.predict(X_test_vectorized), pos_label='negative'))
# 0.828768435166
print(f1_score(
y_test, classifier.predict(X_test_vectorized), pos_label='negative'))
# 0.883151341974
Classification Report
print(classification_report(y_test, classifier.predict(X_test_vectorized)))
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)
vectorizer = CountVectorizer(lowercase=True)
vectorizer.fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)
• Averaging (macro)
• Weighted Averaging (weighted)
print(recall_score(
y_test, classifier.predict(X_test_vectorized), average='weighted'))
# 0.698818642087
print(precision_score(
y_test, classifier.predict(X_test_vectorized), average='weighted'))
# 0.687808952468
print(f1_score(
y_test, classifier.predict(X_test_vectorized), average='weighted'))
# 0.670033645784
print(classification_report(
y_test, classifier.predict(X_test_vectorized)))
When investigating poor classification results, the confusion matrix is a great tool
to use. Let’s see how to build one for our Twitter multi-class example:
cm = confusion_matrix(y_test, classifier.predict(X_test_vectorized),
labels=['positive', 'neutral', 'negative'])
plt.show(block=True)
print(cm)
sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show(block=True)
In the results above, cm[i][j] represents the number of items (or the proportion of
items) that belong to class i but are classified as j.
In conclusion: * Ideally, we should see large values only on the first diagonal of
the confusion matrix(class i gets classified as i) * If there are other large values in
the matrix, we should investigate because it might mean that our system is oftenly
confusing those two classes.
Part-Of-Speech Corpora
To achieve our goal of building a Part of Speech Tagger, we are going to use
the Universal Dependecies Dataset33 You can find complete information on this
ambitious project on their website34 and I recommend you to read a bit about this
project before moving forward.
Let’s start by downloading the two data files: “en-ud-train.conllu”, “en-ud-dev.conllu”,
and place them somewhere handy.
Have a look at these data files. You will notice how the sentences are separated by
a double newline and how each row contains annotations separated by tabs.
Ok, now that we have the data, our first task is to write a function that goes through
the data:
import os
from nltk import conlltags2tree
def read_ud_pos_data(filename):
"""
Iterate through the Universal Dependencies Corpus Part-Of-Speech data
Yield sentences one by one, don't load all the data in memory
"""
current_sentence = []
# ignore comments
if line.startswith('#'):
33 https://github.com/UniversalDependencies/UD_English
34 http://universaldependencies.org/
continue
current_sentence = []
continue
annotations = line.split('\t')
Instead of aiming for building the best model first, we are going to start by building
a simple one. The plan for performing this task are:
1. Train a trigram tagger: When looking for a word w3, if we have already
encountered a trigram of form: w1, w2, w3, with computed tags t1, t2, t3, we will
output tag t3 for word w3. Otherwise, we fallback on the bigram tagger.
2. Train a bigram tagger: When looking for a word w2, if we have already encoun-
tered a bigram of form: w1, w2, with computed tags t1, t2, we will output tag t2
for word w2. Otherwise, we fallback on the unigram tagger.
3. Train a unigram tagger: When looking for a word w1, if we have already
encountered that word and computed its tags t1, we will output tag t1 for word
w1. Otherwise, we fallback on a default choice.
4. Implement a default choice: Since all the above methods failed, we can output
the most common tag in the dataset.
Starting from the default choice, we’ll compute the most common tag and build the
taggers:
import nltk
import time
from collections import Counter
from utils import read_ud_pos_data
most_common_tag = tag_counter.most_common()[0][0]
print("Most Common Tag is: ", most_common_tag) # NN
start_time = time.time()
print("Starting training ...")
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_data, backoff=t0)
t2 = nltk.BigramTagger(train_data, backoff=t1)
t3 = nltk.TrigramTagger(train_data, backoff=t2)
end_time = time.time()
print("Training complete. Time={0:.2f}s".format(end_time - start_time))
As you can notice, we’ve obtained a 0.83 (83%) accuracy. We’ll consider this value
as the baseline for the statistical models we’re going to build next.
Going back to the name gender classifier we built in earlier chapters, you will
remember that the first step was to engineer the features to be extracting from the
names. Similarly, for this task, we will start by extracting features from the word
itself, the previous word and the one following it.
One of the most powerful feature for POS tagging is the shape of the word.
Describing a word’s shape can include:
import re
def shape(word):
if re.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$', word):
return 'number'
elif re.match('\W+$', word):
return 'punct'
elif re.match('[A-Z][a-z]+$', word):
return 'capitalized'
elif re.match('[A-Z]+$', word):
return 'uppercase'
elif re.match('[a-z]+$', word):
return 'lowercase'
elif re.match('[A-Z][a-z]+[A-Z][a-z]+[A-Za-z]*$', word):
return 'camelcase'
elif re.match('[A-Za-z]+$', word):
return 'mixedcase'
elif re.match('__.+__$', word):
return 'wildcard'
elif re.match('[A-Za-z0-9]+\.$', word):
return 'ending-dot'
elif re.match('[A-Za-z0-9]+\.[A-Za-z0-9\.]+\.$', word):
return 'abbreviation'
elif re.match('[A-Za-z0-9]+\-[A-Za-z0-9\-]+.*$', word):
return 'contains-hyphen'
return 'other'
Since NLTK has some pretty good abstractions available, we’re going to follow
mostly its structure for defining our base classes. NLTK contains a wrapper class
for Scikit-Learn classifiers in nltk.classify.SklearnClassifier, but we’re not going
to use it. That’s because we want to write it ourselves, mostly for demonstrative
purposes, but also because later on, we’ll need some extra functionalities that
the NLTK implementation doesn’t expose. Let’s write our own SklearnClassifier
wrapper, called ScikitClassifier:
class ScikitClassifier(nltk.ClassifierI):
"""
Wrapper over a scikit-learn classifier
"""
def __init__(self, classifier=None, vectorizer=None, model=None):
if model is None:
if vectorizer is None:
vectorizer = DictVectorizer(sparse=False)
if classifier is None:
classifier = LogisticRegression()
self.model = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
else:
self.model = model
@property
def vectorizer(self):
return self.model[0][1]
@property
def classifier(self):
return self.model[1][1]
def labels(self):
return list(self.model.steps[1][1].classes_)
Now, we need to write a feature extraction for a given word in a sentence. Here’s
how to do it:
stemmer = SnowballStemmer('english')
# We will be looking two words back in history, so need to make sure we do not go out of bounds
history = ['__START2__', '__START1__'] + list(history)
return {
# Intrinsic features
'word': sentence[index],
'stem': stemmer.stem(sentence[index]),
'shape': shape(sentence[index]),
# Suffixes
'suffix-1': sentence[index][-1],
'suffix-2': sentence[index][-2:],
'suffix-3': sentence[index][-3:],
# Context
'prev-word': sentence[index - 1],
'prev-stem': stemmer.stem(sentence[index - 1]),
'prev-prev-word': sentence[index - 2],
'prev-prev-stem': stemmer.stem(sentence[index - 2]),
'next-word': sentence[index + 1],
# Historical features
'prev-pos': history[-1],
'prev-prev-pos': history[-2],
# Composite
'prev-word+word': sentence[index - 1].lower() + '+' + sentence[index],
}
The function slides over the sentence extracting features for every word, taking into
account the the tags already assigned (the history parameter).
def train_scikit_classifier(dataset):
"""
dataset = list of tuples: [({feature1: value1, ...}, label), ...]
"""
# split the dataset into featuresets and the predicted labels
featuresets, labels = zip(*dataset)
classifier = ScikitClassifier(classifier=LinearSVC())
classifier.train(featuresets, labels)
return classifier
I’ve chosen LinearSVC (Linear Support Vector Classifier) as the classifier implemen-
tation for this example because it performs very well on this sort of tasks.
We are now ready to put everything together. Keep in mind that we are going to
use only the first 2000 sentences from the dataset, because otherwise will take an
unbearable amount of time for training. Don’t worry though, we’ll come back to
this later.
import time
from nltk.tag import ClassifierBasedTagger
from utils import read_ud_pos_data
from tag import pos_features
if __name__ == "__main__":
print("Loading data ...")
train_data = list(read_ud_pos_data('../../../data/en-ud-train.conllu'))
test_data = list(read_ud_pos_data('../../../data/en-ud-dev.conllu'))
print("train_data", train_data)
print("Data loaded .")
start_time = time.time()
print("Starting training ...")
tagger = ClassifierBasedTagger(
feature_detector=pos_features,
train=train_data[:2000],
classifier_builder=train_scikit_classifier,
)
end_time = time.time()
print("Training complete. Time={0:.2f}s".format(end_time - start_time))
Congratulations, you have just trained your first statistical Part of Speech tagger.
An important thing to note here is that even though we used only 2000 sentences
for training, we managed to do significantly better (0.89 accuracy) than using the
Ngram tagger (0.83 accuracy) trained on the full dataset of 12543 sentences.
Out-Of-Core Learning
We got some pretty good results, but we can do even better. The issue with the
previous approach is that the data is too large to fit entirely in RAM. In order
to tackle this issue, we can divide the data into batches and train the classifier
iteratively on each batch.
To implement this approach, we will need to:
• Use a classifier type that can be trained online: This technique is called Out-Of-
Core Learning and implies presenting chunks of data to the classifier, several
times, rather than presenting the whole dataset at once. There are only a few
Scikit-Learn classifiers that have this functionality, mainly the ones implement-
ing the partial_fit method. We will use sklearn.linear_model.Perceptron.
• Use a vectorizer able to accommodate new features as they are “discovered”:
This method is called the Hashing Trick. Since we won’t keep the entire
dataset in memory, it will be impossible to know all the features from the
start. Instead of computing the feature space beforehand, we create a fixed
space that accommodates all our features and assigns a feature to a slot of
the feature space using a hashing function. For this purpose, we will use
the sklearn.feature_extraction.FeatureHasher from Scikit-Learn. The Scikit-Learn
vectorizers we’ve studied so far cannot be altered after calling ‘fit’ method.
To accommodate all this changes, first we need to extend the functionality of the
ClassifierBasedTagger class. We’ll extract the routine of passing the feature extraction
function over the sentence in a separate method. Rather than building the dataset in
the _train function, we’ll need to build a dataset from batches inside the classifier_-
builder function. To achieve that, we need to pass the dataset creation function to
the builder. This might sound a bit intricate, but the code is fairly simple:
Extending ClassifierBasedTagger
from nltk import ClassifierBasedTagger
from nltk.metrics import accuracy
class ClassifierBasedTaggerBatchTrained(ClassifierBasedTagger):
def _todataset(self, tagged_sentences):
classifier_corpus = []
for sentence in tagged_sentences:
history = []
untagged_sentence, tags = zip(*sentence)
for index in range(len(sentence)):
featureset = self.feature_detector(untagged_sentence,
index, history)
classifier_corpus.append((featureset, tags[index]))
history.append(tags[index])
return classifier_corpus
if verbose:
print('Constructing training corpus for classifier.')
import time
import itertools
from sklearn.feature_extraction import FeatureHasher
from sklearn.linear_model import Perceptron
from classify import ScikitClassifier
from classify import ClassifierBasedTaggerBatchTrained
from tag import pos_features
from utils import read_ud_pos_data
def incremental_train_scikit_classifier(
sentences,
feature_detector,
batch_size,
max_iterations):
ALL_LABELS = list(ALL_LABELS)
for _ in range(max_iterations):
while True:
batch = list(itertools.islice(current_corpus_iterator, batch_size))
if not batch:
break
batch_count += 1
print("Training on batch={0}".format(batch_count))
dataset = feature_detector(batch)
return scikit_classifier
Notice how we never call the fit function on the FeatureHasher. We just assign a fixed
size to it from the start.
We also added some other params to our incremental training function:
At this point, we need to pass these values to the function via a closure, since the
_train function doesn’t pass them. Without further ado, here’s how to train your
dragon. Ah, ups, your online learning system:
start_time = time.time()
print("Starting training ...")
tagger = ClassifierBasedTaggerBatchTrained(
feature_detector=pos_features,
train=read_ud_pos_data('../../../data/en-ud-train.conllu'),
classifier_builder=lambda iterator, detector: incremental_train_scikit_classifier(
iterator, detector, batch_size=500, max_iterations=30),
)
end_time = time.time()
print("Training complete. Time={0:.2f}s".format(end_time - start_time))
print(tagger.tag("This is a test".split()))
I know it takes a while, but don’t worry, it is supposed to. And yes, we got another
significant boost: 0.93 accuracy!
You have successfully managed to create a complex NLP artefact: a customized Part
of Speech Tagger. Congratulations!
There’s also normal parsing, called deep parsing, which has as a result an entire
syntax tree. As you can probably guess, Deep Parsing is a much more complex task,
more prone to errors and much slower than simple chunking. Obviously, it really
depends on the situation and deciding which tool would best solve the task should
be made on a case-by-case basis. Let’s see how each type of parsing looks like:
(S
(NP The/DT quick/JJ brown/JJ fox/NN)
(VP jumps/VBZ
(PP over/IN
(NP the/DT lazy/JJ dog/NN)))
(. .))
This type of parsing was created using Stanford Parser. You can check out their
Online Demo35 .
35 http://nlp.stanford.edu:8080/parser/index.jsp
(S
(NP The/DT quick/JJ brown/NN fox/NN)
(VP jumps/VBZ)
(PP over/IN)
(NP the/DT lazy/JJ dog/NN)
./.)
Using nltk.Tree we can visualize how the chunking of this sentence looks like:
Chunk tree
IOB Tagging
O marks that we’re outside of any type of chunk. Here’s how our sample sentence
looks like when presented in this form:
IOB tagged sentence
[
('The', 'DT', 'B-NP'),
('quick', 'JJ', 'I-NP'),
('brown', 'NN', 'I-NP'),
('fox', 'NN', 'I-NP'),
('jumps', 'VBZ', 'B-VP'),
('over', 'IN', 'B-PP'),
('the', 'DT', 'B-NP'),
('lazy', 'JJ', 'I-NP'),
('dog', 'NN', 'I-NP'),
('.', '.', 'O')
]
Since NLTK comes with a chunking corpus, CoNLL Dataset, we don’t have any excuse
not to explore it:
Exploring the CoNLL Dataset
from nltk.corpus import conll2000
print(len(conll2000.chunked_sents())) # 10948
print(len(conll2000.chunked_words())) # 166433
chunked_sentence = conll2000.chunked_sents()[0]
print(chunked_sentence)
# (S
# (NP Confidence/NN)
# (PP in/IN)
# (NP the/DT pound/NN)
# (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
# (NP another/DT sharp/JJ dive/NN)
# if/IN
# (NP trade/NN figures/NNS)
# (PP for/IN)
# (NP September/NNP)
# ,/,
# due/JJ
# (PP for/IN)
# (NP release/NN)
# (NP tomorrow/NN)
# ,/,
# (VP fail/VB to/TO show/VB)
# (NP a/DT substantial/JJ improvement/NN)
# (PP from/IN)
# (NP July/NNP and/CC August/NNP)
# (NP 's/POS near-record/JJ deficits/NNS)
# ./.)
As you can notice, in NLTK the default way of representing Shallow Parses is with
nltk.Tree. Here’s how we can transform the chunked sentence (represented as a
nltk.Tree) to IOB tagged triplets:
# [
# ('Confidence', 'NN', 'B-NP'),
# ('in', 'IN', 'B-PP'),
# ('the', 'DT', 'B-NP'),
# ('pound', 'NN', 'I-NP'),
# ('is', 'VBZ', 'B-VP'),
# ('widely', 'RB', 'I-VP'),
# ('expected', 'VBN', 'I-VP'),
# ('to', 'TO', 'I-VP'),
# ('take', 'VB', 'I-VP'),
# ('another', 'DT', 'B-NP'),
# ('sharp', 'JJ', 'I-NP'),
# ('dive', 'NN', 'I-NP'),
# ('if', 'IN', 'O'),
# ('trade', 'NN', 'B-NP'),
# ('figures', 'NNS', 'I-NP'),
# ('for', 'IN', 'B-PP'),
# ('September', 'NNP', 'B-NP'),
# (',', ',', 'O'),
# ('due', 'JJ', 'O'),
# ('for', 'IN', 'B-PP'),
# ('release', 'NN', 'B-NP'),
# ('tomorrow', 'NN', 'B-NP'),
# (',', ',', 'O'),
# ('fail', 'VB', 'B-VP'),
# ('to', 'TO', 'I-VP'),
# ('show', 'VB', 'I-VP'),
# ('a', 'DT', 'B-NP'),
# ('substantial', 'JJ', 'I-NP'),
# ('improvement', 'NN', 'I-NP'),
# ('from', 'IN', 'B-PP'),
# ('July', 'NNP', 'B-NP'),
# ('and', 'CC', 'I-NP'),
# ('August', 'NNP', 'I-NP'),
# ("'s", 'POS', 'B-NP'),
# ('near-record', 'JJ', 'I-NP'),
# ('deficits', 'NNS', 'I-NP'),
# ('.', '.', 'O')
# ]
# (S
# (NP Confidence/NN)
# (PP in/IN)
# (NP the/DT pound/NN)
# (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
# (NP another/DT sharp/JJ dive/NN)
# if/IN
# (NP trade/NN figures/NNS)
# (PP for/IN)
# (NP September/NNP)
# ,/,
# due/JJ
# (PP for/IN)
# (NP release/NN)
# (NP tomorrow/NN)
# ,/,
# (VP fail/VB to/TO show/VB)
# (NP a/DT substantial/JJ improvement/NN)
# (PP from/IN)
# (NP July/NNP and/CC August/NNP)
# (NP 's/POS near-record/JJ deficits/NNS)
# ./.)
The base class for implementing chunkers in NLTK is called nltk.ChunkParserI. Let’s
extend that class and wrap inside it the ClassifierBasedTaggerBatchTrained we built in
the previous chapter.
ClassifierBasedChunkParser Class
def triplets2tagged_pairs(iob_sent):
"""
Transform the triplets to tagged pairs:
[(word1, pos1, iob1), (word2, pos2, iob2), ...] ->
[((word1, pos1), iob1), ((word2, pos2), iob2),...]
"""
return [((word, pos), chunk) for word, pos, chunk in iob_sent]
def tagged_pairs2triplets(iob_sent):
"""
Transform the triplets to tagged pairs:
[((word1, pos1), iob1), ((word2, pos2), iob2),...] ->
[(word1, pos1, iob1), (word2, pos2, iob2), ...]
"""
return [(word, pos, chunk) for (word, pos), chunk in iob_sent]
class ClassifierBasedChunkParser(ChunkParserI):
def __init__(self, chunked_sents, feature_detector, classifier_builder, **kwargs):
# Transform the trees in IOB annotated sentences [(word, pos, chunk), ...]
chunked_sents = [tree2conlltags(sent) for sent in chunked_sents]
self.feature_detector = feature_detector
self.tagger = ClassifierBasedTaggerBatchTrained(
train=(sent for sent in chunked_sents),
feature_detector=self.feature_detector,
classifier_builder=classifier_builder
)
iob_triplets = tagged_pairs2triplets(chunks)
# Convert (word, pos, iob) triplets to tagged tuples ((word, pos), iob)
chunked_sents = [triplets2tagged_pairs(sent) for sent in chunked_sents]
print(chunked_sents)
dataset = self.tagger._todataset(chunked_sents)
featuresets, tags = zip(*dataset)
predicted_tags = self.tagger.classifier().classify_many(featuresets)
return accuracy(tags, predicted_tags)
Before finishing, we also need to implement the feature detection function. Let’s
see how:
stemmer = SnowballStemmer('english')
return {
'word': word,
'lemma': stemmer.stem(word),
'pos': pos,
'next-word': nextword,
'next-pos': nextpos,
'next-next-word': nextnextword,
'nextnextpos': nextnextpos,
'prev-word': prevword,
'prev-pos': prevpos,
'prev-prev-word': prevprevword,
'prev-prev-pos': prevprevpos,
# Historical features
'prev-chunk': history[-1],
'prev-prev-chunk': history[-2],
}
Now all that’s left is putting everything together in the exact same way we did with
the tagger previously:
Train chunker
if __name__ == "__main__":
# Prepare the training and the test set
conll_sents = list(conll2000.chunked_sents())
random.shuffle(conll_sents)
train_sents = conll_sents[:int(len(conll_sents) * 0.9)]
test_sents = conll_sents[int(len(conll_sents) * 0.9 + 1):]
print("Training Classifier")
classifier_chunker = ClassifierBasedChunkParser(
train_sents,
chunk_features,
lambda iterator, detector: incremental_train_scikit_classifier(iterator, detector, 1000, 4),
)
print("Classifier Trained")
print(classifier_chunker.evaluate(test_sents))
print(classifier_chunker.parse(
pos_tag(word_tokenize("The quick brown fox jumps over the lazy dog."))))
# (S
# (NP The/DT quick/JJ brown/NN fox/NN)
# (VP jumps/VBZ)
# (PP over/IN)
# (NP the/DT lazy/JJ dog/NN)
# ./.)
We’ve got a pretty good value for the accuracy of our new chunker: 0.93.
Conclusions
print(ne_chunk(pos_tag(
"Twitter Inc. is based in San Francisco , California , United States , "
"and has more than 25 offices around the world .".split())))
# (S
# (PERSON Twitter/NNP)
# (ORGANIZATION Inc./NNP)
# is/VBZ
# based/VBN
# in/IN
# (GPE San/NNP Francisco/NNP)
# ,/,
# (GPE California/NNP)
# ,/,
# (GPE United/NNP States/NNPS)
# ,/,
# and/CC
# has/VBZ
# more/JJR
# than/IN
# 25/CD
# offices/NNS
# around/IN
# the/DT
# world/NN
# ./.)
NER Corpora
In order to train our own Named Entity Recognizer, we need a suitable corpus.
There is some data available in NLTK, like ConLL, IEER, that we could use for this
purpose. Let’s give it a go:
As you can notice, the NLTK data is not that satisfying: ConLL is language indepen-
dent, while IEER is lacking POS tags. Let’s give it another try with Groningen Meaning
Bank36 , but before we start, I recommend you to download the data37 first and then
put it somewhere accessible.
The Groningen Meaning Bank is not a gold standard corpus, meaning it’s not
manually annotated and it’s not 100% correct. GMB is significantly larger than
corpuses we’ve worked with so far.
Let’s start by opening a .tags file from the corpus and observe its format. As you can
notice, the tags are not in standard format:
• Instead of using the standard IOB convention, they use the NE tag: O, PER, PER,
O instead of O, B-PER, I-PER, O
• Some tags are composed: {TAG}-{SUBTAG}. For example, per-giv stands for person
- given name.
We need to standardize these tags: use only IOB format, consider only main tags,
and not subtags.
Here’s a list of the main tags and their meaning:
36 http://gmb.let.rug.nl/
37 http://gmb.let.rug.nl/data.php
One important thing to take into consideration is that some tags are very under-
represented. Here are the NE tags counts from this corpus:
Counter({u'O': 1146068, u'geo': 58388, u'org': 48094, u'per': 44254, u'tim': 34789, u'gpe': 20680, u'art': 867, u'eve': 7\
09, u'nat': 300})
We’ll filter out the art, eve and nat tags due to this reason.
Here’s how you can read the corpus:
def ner2conlliob(annotated_sentence):
"""
Transform the pseudo NER annotated sentence to proper IOB annotation
Example:
[(word1, pos1, O), (word2, pos2, PERSON), (word3, pos3, PERSON),
(word4, pos4, O), (word5, pos5, O), (word6, pos6, LOCATION)]
transforms to:
if ner != 'O':
# If it's the first NE is also the first word
if idx == 0:
ner = "B-{0}".format(ner)
current_file += 1
# Skip files until we get to the start_index
if start_index is not None and current_file < start_index:
continue
ner_triplets = []
for row in rows:
annotations = row.split('\t')
word, tag, ner = annotations[0], annotations[1], annotations[3]
iob_triplets = ner2conlliob(ner_triplets)
# Yield a nltk.Tree
yield conlltags2tree(iob_triplets)
print("Total files=", current_file)
You may notice that I’ve added some extra parameters that we don’t necessarily
use, such as start_index and end_index. The role of these parameters is to keep files
aside for testing. At this point, there are exactly 10.000 files in the corpus (from 0 to
9999).
Feature Detection
NE feature detection
stemmer = SnowballStemmer('english')
return {
'word': word,
'lemma': stemmer.stem(word),
'pos': pos,
'shape': shape(word),
'next-word': nextword,
'next-pos': nextpos,
'next-lemma': stemmer.stem(nextword),
'next-shape': shape(nextword),
'next-next-word': nextnextword,
'next-next-pos': nextnextpos,
'next-next-lemma': stemmer.stem(nextnextword),
'next-next-shape': shape(nextnextword),
'prev-word': prevword,
'prev-pos': prevpos,
'prev-lemma': stemmer.stem(prevword),
'prev-iob': previob,
'prev-shape': shape(prevword),
'prev-prev-word': prevprevword,
'prev-prev-pos': prevprevpos,
'prev-prev-lemma': stemmer.stem(prevprevword),
'prev-prev-iob': prevpreviob,
'prev-prev-shape': shape(prevprevword),
}
NER Training
The training routine we’re going to use is exactly the one we’ve used for the
chunker, and here’s how it looks:
Training NER
if __name__ == "__main__":
# Prepare the training and the test set
conll_sents = list(conll2000.chunked_sents())
random.shuffle(conll_sents)
train_sents = conll_sents[:int(len(conll_sents) * 0.9)]
test_sents = conll_sents[int(len(conll_sents) * 0.9 + 1):]
print("Training Classifier")
classifier_chunker = ClassifierBasedChunkParser(
train_sents,
ner_features,
lambda iterator, detector: incremental_train_scikit_classifier(iterator, detector, 500, 2),
)
print("Classifier Trained")
print(classifier_chunker.evaluate(test_sents))
print(classifier_chunker.parse(
pos_tag(word_tokenize(
"Twitter Inc. is based in San Francisco , California , United States , "
"and has more than 25 offices around the world ."))))
I obtained a 0.975 accuracy. That’s pretty awesome. Here’s the result of applying the
NER on our sample sentence:
(S
(org Twitter/NNP Inc./NNP)
is/VBZ
based/VBN
in/IN
(geo San/NNP Francisco/NNP)
,/,
(geo California/NNP)
,/,
(geo United/NNP States/NNPS)
,/,
and/CC
has/VBZ
more/JJR
than/IN
25/CD
offices/NNS
around/IN
the/DT
world/NN
./.)
Conclusions
• All nodes have a single parent, called head. This effectively means, that all
nodes have exactly one inbound edge.
• A special ROOT node, added to the sentence for uniformity, has no head, and thus
no inbound edge.
• A node can have 0 to any number of dependants, represented through an edge
to that dependent
• Edges have labels, due to which this is called a typed dependency parse. Here’s
the Stanford Parser guide to the various types of dependencies39 .
• The edges do not cross. This is true for most languages. It’s called projective
dependency parse. The opposite is called non-projective dependency parse and
is specific to languages like: German, Dutch and Czech.
As we already mentioned, a node can have only one head but it can have any
number of dependents. That’s because the head is the main word of a phrase(noun
38 http://nlp.stanford.edu:8080/corenlp/
39 http://ufal.mff.cuni.cz/~hladka/2015/docs/Stanford-dependencies-manual.pdf
phrase, verb phrase, etc.) from the sentence. The dependents provide further details
about the head. In the example provided above, the root node is “jumps”. Here’s how
to think about heads and dependents:
1. “jumps”
2. nsubj: Who jumps? → “fox jumps”
3. det: Which fox jumps? → “The fox jumps”
4. amod: What kind of fox jumps? → “The quick brown fox jumps”
You’ll need to read what every dependency relation stands for, but here’s a quick
reference of the three we’ve mentioned so far:
Until recently, dependency parsers were a very sought-after resource. The corpora
was scarce and there weren’t many (if at all) viable ready built dependency parsers
to use in a project. Things have changed since then and today we have quite a few
options available:
One way of looking at the dependency parsing problem might be by trying to reduce
it to a classification problem. This is a valid approach, even if at a first glance it
doesn’t look like a traditional classification problem. The solution of the problem
will tell us what’s the head of every word in the sentence. Since each nodes has
exactly one inbound edge from its head node, we need to identify which node is
the inbound edge coming from for each node in the sentence. Therefore, we might
consider this problem more of a selection problem rather than a classification one.
We could compute the probability of each node being the head of every other
node and pick the optimal combination. However, the result is not guaranteed to
have a valid structure that respects all the rules listed above. We need to force our
algorithm somehow to only give us valid results: without crossing edges, etc.
Leaving the edge labels aside, and only focusing on identifying the source and
destination of the edges, let me present to you how Transition-based Dependency
Parsing works:
Let’s take a really simple example: I buy green apples. Now let’s apply the operations
to get the dependency parse, step-by-step:
Step 0
stack = ["I"]
buffer = ["buy", "green", "apples"]
arcs = []
Step 1 - LEFT-ARC
stack = []
buffer = ["buy", "green", "apples"]
arcs = [("buy", "I")]
Step 2 - SHIFT
stack = ["buy"]
buffer = ["green", "apples"]
arcs = [("buy", "I")]
Step 3 - SHIFT
stack = ["buy", "green"]
buffer = ["apples"]
arcs = [("buy", "I")]
Step 4 - LEFT-ARC
stack = ["buy"]
buffer = ["apples"]
arcs = [("buy", "I"), ("apples", "green")]
Step 5 - SHIFT
stack = ["buy", "apples"]
buffer = []
arcs = [("buy", "I"), ("apples", "green")]
Step 6 - RIGHT-ARC
stack = ["buy"]
buffer = []
arcs = [("buy", "I"), ("apples", "green"), ("buy", "apples")]
Step 7 - LEFT-ARC (draw an arc from the ROOT node to the remaining node)
stack = []
buffer = []
arcs = [("buy", "I"), ("apples", "green"), ("buy", "apples"), (ROOT, "buy")]
The solution to the dependency parse is: [LEFT-ARC, SHIFT, SHIFT, LEFT-ARC, SHIFT,
RIGHT-ARC, LEFT-ARC].
It may seem pretty straightforward at this point, but there’s a catch: for a given
dependency tree, more than one solution is possible and there’s really no way of
knowing which one is the best one.
Now that we have a method for building the tree, it is easy to see that this problem
can be turned into a classification one. Since we have a dependency tree dataset,
Universal Dependencies one, we now need to convert it to a transitions dataset
using the steps described above. The classifier we will train should be able to tell
us which transition (SHIFT, LEFT-ARC or RIGHT-ARC) we should apply at each step, given
a current stack and buffer.
Because for a given tree there can be more than one solution, we can’t build the
transitions dataset, at least not yet. At this point, we really need a way to decide if
one transition sequence is better than another. After all, we aim to use the one with
the highest score to build this dataset.
• DependencyParser class: The parser class that wraps the needed statistical models.
It is able to train the models and to use them to parse a tagged sentence.
Bare in mind that at first we’re just going to draw the edges. After we fully under-
stand this mechanism, we’ll rewrite the code, and add the labelling functionality as
well. As promised, this will be an end-to-end solution.
Dependency Dataset
def iter_ud_dep_data(filename):
current_sentence = []
# ignore comments
if line.startswith('#'):
continue
current_sentence = []
continue
annotations = line.split('\t')
This is a simple procedure that reads the Universal Dependency file and returns
a list of quadruplets for each sentence: (word, tag, head, label). head is the index of
the node’s parent. In fact, the simplest representation of a dependency parse is a
parent list: [parent(word) for word in sentence]. Here’s the DependencyParse class we’ll
be using:
DependencyParse class
class DependencyParse(object):
def __init__(self, tagged_sentence):
# pad the sentence with a dummy __START__ and a __ROOT__ node
self._tagged = [('__START__', '__START__')] + tagged_sentence + [('__ROOT__', '__ROOT__')]
def words(self):
return [tt[0] for tt in self._tagged]
def tags(self):
return [tt[1] for tt in self._tagged]
def tagged_words(self):
return self._tagged
def heads(self):
return self._heads
def lefts(self):
return self._left_deps
def rights(self):
return self._right_deps
def deps(self):
return self._deps
def labels(self):
return self._labels
'word': self._tagged[item][0],
'pos': self._tagged[item][1],
'head': self._heads[item],
'label': self._labels[item],
'children': list(zip(self._deps[item], self._labels[item])),
'left_children': self._left_deps[item],
'right_children': self._right_deps[item],
}
def __len__(self):
return len(self._tagged)
def __str__(self):
return str(self._heads)
def __repr__(self):
return str(self._heads)
We store extra information that might seem redundant now(e.g: dependents list),
but it will prove useful for future work and for doing feature extraction.
You probably noticed that in the iter_ud_dep_data we don’t actually yield parses, but
list of quadruplets. We need parses because that’s what our parser will work with.
Let’s use the iter_ud_dep_data function to read the raw data and build the parses. This
is done in the iter_ud_dependencies function:
iter_ud_dependencies function
def iter_ud_dependencies(filename):
for annotated_sentence in iter_ud_dep_data(filename):
words = []
tags = []
heads = [None]
labels = [None]
for i, (word, pos, head, label) in enumerate(annotated_sentence):
# Skip some fictive nodes that don't belong in the graph
if head == '_':
continue
words.append(word)
tags.append(pos)
if head == '0':
# Point to the dummy node
heads.append(len(annotated_sentence) + 1)
else:
heads.append(int(head))
labels.append(label)
yield dep_parse
We now know what a DependencyParse object looks like and we know how to read it
from the data files. Let’s start talking about the ParserState we mentioned before.
Our system works with 3 types of transitions: SHIFT, LEFT_ARC and RIGHT_ARC. The
ParserState class this is a very important component. Here are its responsibilities: *
It knows how to apply a transition to a parse * It is able to tell us what are the next
valid transitions from a given state * Given the current parse and the gold parse
(the annotated parse from the dataset) it is able to tell us what are the next correct
transitions towards the gold parse.
Let’s talk about the last 2 points.
The valid transitions issue is pretty straight-forward. It makes sure we have enough
nodes in the stack/buffer to perform the operation. For example, we can’t apply
SHIFT to a state that has an empty buffer.
The next transitions towards the gold parse part is a bit more subtle. It makes sure
that out of the valid transitions, we only choose the ones that take us towards the
gold parse. To make sure we’re on the right road towards the “gold path”, we need
to make sure that by applying a transition we don’t lose any dependencies. There
are of course cases when if we make a certain transition, we can’t go back and add
a certain dependency. For example, if a node is moved via a SHIFT transition, that
node can’t be a parent of any node present in the current stack. Knowing the gold
parse, we can avoid such situations.
ParseState class
class Transitions:
SHIFT = 0
LEFT_ARC = 1
RIGHT_ARC = 2
class ParserState(object):
# transition == Transitions.LEFT_ARC
else:
# New edge from `buffer_index` to `stack[-1]`
head, child = self.buffer_index, self.stack.pop()
def next_valid(self):
""" The list of transitions that are allowed in the current state """
valid = []
return valid
if not self.stack or (
Transitions.SHIFT in valid and
gold_heads[self.buffer_index] == self.stack[-1]):
return [Transitions.SHIFT]
# If there's any dependency between the current item in the buffer and
# a node in the stack, moving the item in the stack would loose the dependency
return list(valid)
Let’s now build the our dependency parser. We’ll be taking the same approach
we did previously of training the model in batches and using the FeatureHasher
vectorizer for incrementally building the feature space. There’s one difference:
we’ll going to need a model that can be trained in batches as well as being capable of
giving us the prediction probabilities (should implement the predict_proba method).
Scikit-Learn doesn’t offer many choices for this type of model. One of them, that
matches our requirements is SGDClassifier (Stochastic Gradient Descent Classifier)
that uses the modified_huber loss function. We’re not going to go into the details of
how this classifier or its loss function works, but I encourage you to look it up in the
sklearn documentation. In case you are wondering why we need the probabilities
for the predicted classes, here’s why: we will use the classifier to get the best next
transition. Since the classifier can predict any class, but only some can be valid, we
need to know what’s the most probable one out of the valid options.
Feature extraction is also a very important component of our system. In this case,
the feature extraction method will be applied on a ParserState instance. We’ll be
exploring its implementation later on.
Like the chunker and NER we’ve built earlier, the parser also deals with tagged
sentences.
Parser class
class Parser(ParserI):
@staticmethod
def build_transition_dataset(parses, feature_extractor):
""" Transform a list of parses to a dataset """
transitions_X, transitions_y = [], []
for gold_parse in parses:
# Init an empty parse
dep_parse = DependencyParse(gold_parse.tagged_words()[1:-1])
if not gold_moves:
# Something is wrong here ...
break
return state.parse
while True:
batch_count += 1
print("Training on batch={0}".format(batch_count))
batch = list(itertools.islice(parses, batch_size))
# No more batches
if not batch:
break
self._model.partial_fit(self._vectorizer.transform(t_X), t_Y,
classes=Transitions.ALL)
In my opinion, after building the ParserState class and after understanding how
the transition system works, the parser implementation is pretty simple. The
missing piece is the feature extraction function. The code is pretty ugly and boring.
Nevertheless, we must talk about it. The features are obviously more convoluted
than what we’re used to from the previous models we’ve built. As we previously
mentioned, the feature extraction function is applied on the current state of the
parser. It is a mix of words and tags, from the stack or from the buffer, from the
dependents (left or right), by themselves or combined into bigrams and trigrams.
Here’s a useful function for padding a list to the right:
pad_right function
dependency_features function
def dependency_features(state):
""" Extract features from a parser state """
length = len(state.parse)
words = state.parse.words()
tags = state.parse.tags()
f = {}
# stack - words/tags
f['w-s0'], f['w-s1'], f['w-s2'] = stack_features(words)
f['t-s0'], f['t-s1'], f['t-s2'] = stack_features(tags)
# buffer - words/tags
f['w-b0'], f['w-b1'], f['w-b2'] = buffer_features(words)
f['t-b0'], f['t-b1'], f['t-b2'] = buffer_features(tags)
# =============================================
# =============================================
# =============================================
# =============================================
# =============================================
# distance between the top node in the stack and the current node in the buffer
f['dist-b-s'] = state.buffer_index - top_s
# =============================================
# Word/Tag pairs
f['w-s0$t-s0'] = f['w-s0'] + '$' + f['t-s0']
f['w-s1$t-s1'] = f['w-s1'] + '$' + f['t-s1']
f['w-s2$t-s2'] = f['w-s2'] + '$' + f['t-s2']
# =============================================
# Bigrams
f['w-s0$w-b0'] = f['w-s0'] + '$' + f['w-b0']
f['t-s0$t-b0'] = f['t-s0'] + '$' + f['t-b0']
# =============================================
# Trigrams
f['w-s0$w-s1$w-s2'] = f['w-s0$w-s1'] + '$' + f['w-s2']
f['t-s0$t-s1$t-s2'] = f['t-s0$t-s1'] + '$' + f['t-s2']
return f
def main():
train_data = list(iter_ud_dependencies('../../../data/en-ud-train.conllu'))
test_data = list(iter_ud_dependencies('../../../data/en-ud-dev.conllu'))
parser = Parser(dependency_features)
parser.train(train_data, n_iter=5, batch_size=200)
The reason why we’re only evaluating on the first 500 sentences from the test
set is because our parser isn’t particularly fast. It trains quickly, because we take
advantage of the vectorized operations, but it is very slow when actually parsing.
That’s because we need to call the classifier for each and every word because the
result is needed for the next word. This is not ideal and there definitely are several
solutions for this. The purpose of this solution is not to build the most accurate and
efficient parser, but rather to understand the algorithm that drives it.
We’ve got close to 80% accuracy:
Accuracy: 0.799307530605
Here’s the complete list of dependency labels and their description: UD labels46
Let’s create a list with these labels, that we will later use to train our labeler:
List of labels
DEPENDENCY_LABELS = [
"acl",
"acl:relcl",
"advcl",
"advmod",
"amod",
"appos",
"aux",
"aux:pass",
"case",
"cc",
"cc:preconj",
"ccomp",
"compound",
"compound:prt",
"conj",
"cop",
"csubj",
"csubj:pass",
"dep",
"det",
"det:predet",
"discourse",
"dislocated",
"expl",
46 http://universaldependencies.org/u/dep/
"fixed",
"flat",
"flat:foreign",
"goeswith",
"iobj",
"list",
"mark",
"nmod",
"nmod:npmod",
"nmod:poss",
"nmod:tmod",
"nsubj",
"nsubj:pass",
"nummod",
"obj",
"obl",
"obl:npmod",
"obl:tmod",
"orphan",
"parataxis",
"punct",
"reparandum",
"root",
"vocative",
"xcomp"
]
We’ll use most of the code we’ve written in previous section and create one more
model that given an already parsed sentence, adds the labels to the edges. Our new
parser will wrap 2 models rather than one. First, we will need to add static method
to the Parser class that transforms a list of labeled parses into a dataset. This will be
fed to our labeler model. Note that this feature extractor is different than the one
we used in the previous section. This one is responsible for extracting features for
labelling an edge.
build_labels_dataset static method
class Parser(ParserI):
@staticmethod
def build_labels_dataset(parses, feature_extractor):
""" Transform a list of parses to a labels dataset """
labels_X, labels_y = [], []
for gold_parse in parses:
for child, head in enumerate(gold_parse.heads()[1:-1]):
features = feature_extractor(gold_parse, head, child + 1)
label = gold_parse.labels()[child + 1]
labels_X.append(features)
labels_y.append(label)
In the Parser constructor method, we’re going to initialize another model, responsi-
ble for assigning labels:
Parser init
def __init__(self, feature_detector, label_feature_detector):
self.feature_extractor = feature_detector
self.label_feature_detector = label_feature_detector
self._vectorizer = FeatureHasher()
self._model = SGDClassifier(loss='modified_huber')
self._label_vectorizer = FeatureHasher()
self._label_model = Perceptron()
We need to change the train_batch method in order to be able to train the labeler
model while training the transition model.
train labeler in train_batch
def train_batch(self, gold_parses):
""" Train the model on a single batch """
t_X, t_Y = self.build_transition_dataset(
gold_parses, self.feature_extractor)
self._model.partial_fit(self._vectorizer.transform(t_X), t_Y,
classes=Transitions.ALL)
self._label_model.partial_fit(self._label_vectorizer.transform(l_X), l_Y,
classes=DEPENDENCY_LABELS)
As planned, we are now going to write a method that given a parsed sentences, adds
the labels. We’ll call it label_parse.
label_parse method
def label_parse(self, parse):
""" Add labels to a dependency parse """
label_features = []
for child, head in enumerate(parse.heads()[1:-1]):
features = self.label_feature_detector(parse, head, child + 1)
label_features.append(features)
vectorized_label_features = self._label_vectorizer.transform(label_features)
predicted_labels = self._label_model.predict(vectorized_label_features)
parse._labels = [None] + list(predicted_labels) + [None]
return parse
We need to make a small change to the parse method, to add the labels just before
returning the computed parse:
return state.parse
return f
We’re almost done. Let’s tweak the evaluate method such that it returns the accuracy
of the labeler as well. Here’s how:
heads = np.array(parse.heads()[1:-1])
predicted_heads = np.array(predicted_parse.heads()[1:-1])
labels = np.array(parse.labels()[1:-1])
# Relabel the gold parse with what our model would label
self.label_parse(parse)
predicted_labels = np.array(parse.labels()[1:-1])
total += len(heads)
correct_heads += np.sum(heads == predicted_heads)
correct_labels += np.sum(labels == predicted_labels)
def main():
train_data = list(iter_ud_dependencies('../../../data/en-ud-train.conllu'))
test_data = list(iter_ud_dependencies('../../../data/en-ud-dev.conllu'))
• ChatFuel47
• Microsoft LUIS.ai48 - Language Understanding Intelligent Service
• Facebook Wit.ai49 - It can now be used directly from the Messenger Platform50
• Google DialogFlow51 - Previously known as api.ai
• Amazon Lex52
There are considerably more, but I consider these to be the most popular. The
landscape is also constantly evolving so be sure to do some research beforehand
if building chatbots is your thing.
In this chapter, we’ll be building a toy conversational NLP engine, similar to
DialogFlow. Obviously the functionality will be only a teeny-tiny fraction of what
DialogFlow has to offer but we’re here to understand the underlying principles.
47 https://chatfuel.com/
48 https://www.luis.ai/
49 https://wit.ai/
50 https://messenger.fb.com/
51 https://dialogflow.com/
52 https://aws.amazon.com/lex/
This is the most important feature. We want to provide a list of examples and the
chatbot will learn to adapt to variations of those examples. Given the example: “I
want to buy tickets for New York”, we decide that the action needed is buy_tickets
53 https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist
with parameters: {"location": "New York"}. The chatbot should figure out if given the
sentence: “I need tickets for Seattle” that the action is buy_tickets with parameters
{"location": "Seattle"}.
Our bot should be able to reply with something like “That’s not a nice thing to say”
when foul language is used. We should also be able to respond to greetings: Bye, See
you later and other basic stuff like that.
Action Handlers
After our platform detects that we need to buy_tickets, we need to be able to perform
some API calls so that we actually buy the tickets and send them to the user.
I’m going to name this chatbot architecture The Intent/Entities architecture. This
is something I just made up so don’t take it too seriously. It might have another
official name or it may not. It is however essentially the same architecture used by
DialogFlow.
Here’s how this architecture works:
• Every input the chatbot receives is associated to one of the predefined intents
the chatbot knows how to handle. By intent we mean exactly what the dictio-
nary says it means: “intention or purpose. synonyms: aim, purpose, intention,
objective, object, goal, target, end;
• We want the handlers of the intents to be parameterizable. Going back to the
example bot that sells tickets, we want to tell the bot for which event we want
the tickets for.
I’m going to make a simple and practical analogy to URLs and REST verbs. Let’s
consider the following HTTP request:
POST /tickets/buy?event_id=11145&quantity=2
After applying the intent detection and entity extraction, this would look like this:
If we would delete the noise, we would only be left with something like this:
Next, we’ll implement a Chatbot base class that takes the datasets and trains the
classifiers for identifying intents and entities. In order to make the chatbot answer
with actual information to user queries we need to register handlers for intents.
This will be tackled in the next chapter.
Let’s start by building the training set. I’ve organized the training set in 2 parts: some
general intents and some movie-specific intents. Here’s the dataset I’ve written:
Intent/Entities dataset
MOVIE_EXPERT_TRAINING_SET = {
# General Intents
'greetings': [
("Hello", {}),
("Hello!", {}),
("Hi!", {}),
("How are you?", {}),
("Hi There!", {}),
("Hello there!!!", {}),
("Hi!", {}),
("Ello!", {}),
("Hey!", {}),
("Hello mate, how are you?", {}),
("Good morning", {}),
("Good morning!", {}),
("mornin'", {}),
],
'bye': [
("Bye!", {}),
("Bye Bye!", {}),
("ByeBye!", {}),
("Bbye!", {}),
("See you!", {}),
("See you later", {}),
("Good bye", {}),
("See you soon", {}),
("Talk to you later!", {}),
],
'thanks': [
("Thanks dude!", {}),
("Thank you!", {}),
("Thanks", {}),
("10x", {}),
("10q", {}),
("awesome, thanks for your help", {}),
("Thank you so much!", {}),
("Cheers!", {}),
("Cheers, thanks", {}),
("many thanks", {}),
],
'rude': [
("You're an idiot", {}),
("You're such a fucking idiot!", {}),
("You're so stupid !!!", {}),
("You stupid fuck !!!", {}),
("Shut up!", {}),
("Shut the fuck up you fuckhead!", {}),
("You're such a mother fucker!", {}),
("shit!!", {}),
("fuck!!", {}),
("fuck you!!", {}),
("haha, you stupid !", {}),
],
'chit-chat': [
("How are you?", {}),
("Hey, how are you?", {}),
("How's life?", {}),
("How you've been?", {}),
("How's your day?", {}),
("How's life?", {}),
("How's business?", {}),
],
'ask-info': [
("Tell me more about yourself", {}),
("Who are you?", {}),
("Can you provide more info?", {}),
("Can I get some information?", {}),
("Give me more info", {}),
("Give me some info please", {}),
("I need some info", {}),
("I would like to know more please", {}),
("Tell me about you", {}),
("Let's talk about you", {}),
("Who are you?", {}),
],
]
}
As you can notice, we have several intents and two entity types: movies and actors.
The goal is to create an intent classifier (which is nothing more than a text classifier
like the ones we’ve build in chapters 2 and 3) and an entity extractor, like the one
we’ve built in chapter 4.
def train_intent_model(train_set):
pipeline = Pipeline(steps=[
('vectorizer', CountVectorizer(tokenizer=tokenizer.tokenize)),
('classifier', MultinomialNB()),
])
dataset = []
for intent, samples in train_set.items():
for sample in samples:
dataset.append((sample[0], intent))
random.shuffle(dataset)
intent_X, intent_y = zip(*dataset)
pipeline.fit(intent_X, intent_y)
print("Accuracy on training set:", pipeline.score(intent_X, intent_y))
return pipeline
Training the extractors is a bit trickier. We’re going to train one extractor for each
entity type. First, we need to create datasets for every entity. We’re going to extract
the sentences that contain references to the each entity type we want to train an
extractors for.
Next thing we need to build is a mapping between intents and extractors. We don’t
want to apply all the extractors all the time. We want to apply an extractor only to
the intents that contain samples with such entities.
start_index = 0
while len(lst) > len(slst):
if lst[:len(slst)] == slst:
return start_index
return None
words = nltk.untag(tagged_sentence)
start_index = sub_list(words, entity_words)
if start_index is not None:
iob_tagged[start_index] = (
iob_tagged[start_index][0],
iob_tagged[start_index][1],
'B-' + label
)
for idx in range(1, len(entity_words)):
iob_tagged[start_index + idx] = (
iob_tagged[start_index + idx][0],
iob_tagged[start_index + idx][1],
'I-' + label
)
return nltk.conlltags2tree(iob_tagged)
def build_extractors_datasets(train_set):
"""
Transform the training set from the original form nltk.Tree organized by entity_type
{entity_type: [tree1, tree2, ...]}
"""
tokenizer = TweetTokenizer()
datasets = defaultdict(list)
for _, samples in train_set.items():
for sample in samples:
words = tokenizer.tokenize(sample[0])
tagged = nltk.pos_tag(words)
return datasets
def build_intent_extractor_mapping(train_set):
mapping = {}
for intent in train_set:
mapping[intent] = set({})
for sample in train_set[intent]:
mapping[intent].update(list(sample[1].keys()))
return mapping
def train_extractor(trees):
def train_scikit_classifier(sentences, feature_detector):
dataset = feature_detector(sentences)
featuresets, labels = zip(*dataset)
scikit_classifier = ScikitClassifier()
scikit_classifier.train(featuresets, labels)
return scikit_classifier
classifier_chunker = ClassifierBasedChunkParser(
trees,
chunk_features,
train_scikit_classifier,
)
extractor_datasets = build_extractors_datasets(MOVIE_EXPERT_TRAINING_SET)
movie_extractor = train_extractor(extractor_datasets['movie'])
actor_extractor = train_extractor(extractor_datasets['actor'])
Everything together
class Chatbot(object):
def __init__(self, min_intent_confidence=.2):
self.min_intent_confidence = min_intent_confidence
self.intent_model = None
self.entity_types = []
self.extractors = {}
self.intent_extractor_mapping = None
self.handlers = {}
extractor_datasets = build_extractors_datasets(training_set)
self.intent_extractor_mapping = build_intent_extractor_mapping(training_set)
return self.intent_model.classes_[best_score_index]
if extracted_entities:
first_entity = extracted_entities[0]
entities[extractor_name] = ' '.join(t[0] for t in first_entity)
return entities
return wrapper
def register_default_handler(self):
""" Decorator for registering the default (fallback) handler """
def wrapper(handler):
self.handlers[None] = handler
return wrapper
In order to get the movie information we need, we’ll use the The Movie DB54
API. To be able to use the API you need to create an account and request API
access. After you set up API access go check out the documentation of the API
here: https://developers.themoviedb.org/355 . We won’t use the API directly, but
through a library. Go a bit through the documentation to familiarize yourself to
the information is available and how to access it. In order to use the API, get your
v3 auth API key from your API Settings Page56 :
54 https://www.themoviedb.org/
55 https://developers.themoviedb.org/3
56 https://www.themoviedb.org/settings/api
Let’s now install a simple library that handles the API access for TMDB:
Small-Talk Handlers
Before jumping into building handlers that provide movie information from TMDB
let’s warm up by doing some small-talk ones:
@bot.register_default_handler()
def default_response(_):
return "Sorry, I can't help you out with that ..."
@bot.register_handler('greetings')
def say_hello(_):
return random.choice([
"Hi!",
"Hi there!",
"Hello yourself!"
])
@bot.register_handler('bye')
def say_bye(_):
return random.choice([
"Goodbye to you!",
"Nice meeting you, bye",
"Have a good day!",
"Goodbye"
])
@bot.register_handler('thanks')
def say_thanks(_):
return random.choice([
"No, thank you!",
"It's what I do ...",
"At your service ...",
"Come back anytime"
])
@bot.register_handler('chit-chat')
def make_conversation(_):
return random.choice([
"I'm fine, thanks for asking ...",
"The weather is fine, business is fine, I'm pretty well myself ...",
"I'm bored",
"All good, thanks",
])
@bot.register_handler('rude')
def be_rude(_):
return random.choice([
"That's pretty rude of you. Come back with some manners ...",
"You're rude, leave me alone.",
"That's not very nice ...",
"You're not a nice person, are you?",
])
@bot.register_handler('ask-info')
def provide_info(_):
return random.choice([
"I know a looot of movies, try me!",
"I can help you with movies and stuff ...",
Simple Handlers
Notice how we’ve added some diversity by selecting a random line out of a number
of choices. Also notice that these handlers don’t require any parameters. This is
not the case for most of the execution endpoints that perform real operations. Let’s
start with the most simple one: the handler associated with the get-latest intent. It
doesn’t require any entities or parameters. It’s the perfect opportunity to using the
TMDB api:
@bot.register_handler('get-latest')
def get_latest(_):
try:
latest_movies = tmdb.Movies().now_playing()['results']
random.shuffle(latest_movies)
return "Here are some new movies I think you're going to like: %s" % (
", ".join(['"' + m['original_title'] + '"' for m in latest_movies[:5]]))
except KeyError:
return "I'm not as up to date as I used to be. Sorry."
Execution Handlers
Let’s now write the more execution handlers. The code is pretty straight-forward,
as most of them search for a movie and extract some particular information.
Execution handlers
@bot.register_handler('what-director')
def what_director(entities):
if 'movie' not in entities:
return "Sorry, I didn't quite catch that."
search = tmdb.Search()
search.movie(query=entities['movie'])
try:
movie = tmdb.Movies(search.results[0]['id'])
except (KeyError, IndexError):
return "That's weird ... I never heard of this movie."
try:
director = [p for p in movie.credits()['crew'] if p['job'] == 'Director'][0]
@bot.register_handler('what-release-year')
def what_release_year(entities):
if 'movie' not in entities:
return "Sorry, what movie is that again?"
search = tmdb.Search()
search.movie(query=entities['movie'])
try:
movie = search.results[0]
except (KeyError, IndexError):
return "Is that a real movie or did you just make it up?"
try:
return "The release date of %s is %s" % (movie['title'], movie['release_date'])
except (KeyError, IndexError):
return "That's weird ... I don't know the release date."
@bot.register_handler('get-similar-movies')
def get_similar_movies(entities):
if 'movie' not in entities:
return "Sorry, I didn't quite catch that."
search = tmdb.Search()
search.movie(query=entities['movie'])
try:
movie = tmdb.Movies(search.results[0]['id'])
except (KeyError, IndexError):
return "That's weird ... I never heard of this movie."
try:
recommendations = movie.similar_movies()['results']
random.shuffle(recommendations)
return "Here are a few ones I think you're going to enjoy: %s" % (
", ".join(['"' + m['original_title'] + '"' for m in recommendations[:5]]))
except (KeyError, IndexError):
return "That's weird ... I don't know this one."
@bot.register_handler('actor-movies')
def get_actor_movies(entities):
try:
actor = entities['actor']
except KeyError:
return "Sorry, what actor is that?"
search = tmdb.Search()
search.person(query=actor)
try:
actor = tmdb.People(search.results[0]['id'])
except (IndexError, KeyError):
return "That's weird ... I never heard of him/her."
movies = actor.movie_credits()['cast']
if not movies:
return "I don't know of any movies featuring him/her."
random.shuffle(movies)
@bot.register_handler('movie-cast')
def movie_cast(entities):
if 'movie' not in entities:
return "Sorry, I didn't quite catch that."
search = tmdb.Search()
search.movie(query=entities['movie'])
try:
movie = tmdb.Movies(search.results[0]['id'])
except (IndexError, KeyError):
return "That's weird ... I never heard of this movie."
cast = movie.credits()['cast']
if not cast:
return "I don't know of any actors playing in this movie ..."
@bot.register_handler('actor-in-movie')
def actor_in_movie(entities):
try:
actor, movie = entities['actor'], entities['movie']
except KeyError:
return "Did who play in what?"
search = tmdb.Search()
search.movie(query=movie)
try:
movie = tmdb.Movies(search.results[0]['id'])
except (IndexError, KeyError):
return "That's weird ... I never heard of this movie."
search = tmdb.Search()
search.person(query=actor)
try:
actor = tmdb.People(search.results[0]['id'])
except (IndexError, KeyError):
return "That's weird ... I never heard of this actor."
cast = movie.credits()['cast']
role = [a for a in cast if a['id'] == actor.info()['id']]
if not role:
The loop
while True:
line = input("You: ")
print("MovieBot: ", bot.tell(line))
Chatting
Go ahead and have some fun with your trained bot. Find some examples where it
fails and try to edit the training set to fix those scenarios.
Installing ngrok
Head over to ngrok.com57 to download and install this utility. Here’s how to create
a tunnel on port 9898:
57 https://ngrok.com/
Launch ngrok
Notice how ngrok created 2 URLs that are accessible from outside, both for HTTP
and HTTPS. Your URLs will be different. Try them out and see what happens. You
probably get a message saying that there was no client at the end of the ngrok
tunnel. That’s the job of our chatbot.
In order to make our chatbot play with the Facebook platform, we need to adhere
to some standards:
Let’s start by putting sensitive information inside a .env file like this:
FB_CHALLENGE='YOUR-CHALLENGE-PHRASE'
FB_PAGE_ACCESS_TOKEN='YOUR-ACCESS-TOKEN'
You don’t have an access token yet but this is the place where you are going to put
it. You can come up with whatever challenge phrase you want. Make sure you load
the environment variables by running:
$ source .env
A Facebook chatbot is basically a web application with only one endpoint that
Facebook will be calling. If the request method is GET then Facebook is doing a
webhook verification. If the request method is POST than Facebook is sending an
event, usually an incoming message. Let’s write a simple Flask application for this:
import os
import json
import requests
from flask import Flask, request
from moviebot import bot
app = Flask(__name__)
MOVIE_BOT_CHALLENGE = os.environ.get('FB_CHALLENGE')
FB_PAGE_ACCESS_TOKEN = os.environ.get('FB_PAGE_ACCESS_TOKEN')
In order to run the Flask application, append this line to the file:
app.run('0.0.0.0', 9898)
Setting up Facebook
To connect the chatbot to a page, you need to either create a Facebook Page or use an
existing one (this is the place where the chatbot will be displayed) and a Facebook
Application (this will be the actual chatbot).
In case you never created a Facebook Page before, you can do so here: Create
Facebook Page58 I created an Entertainment Facebook page named “The Movie
Chatbot” and chose the category “TV Show” cause nothing else really fitted.
58 https://www.facebook.com/pages/create/
Head over to the Messenger menu item inside your Facebook Application. We now
need to connect the application to the Facebook Page:
Notice how Facebook generated a Page Access Token for us. Make sure you copy&paste
it inside the .env file and reload it:
$ source .env
Verify Webhook
Subscribe to Page
Trying it Out
If everything went smoothly up to this point, when you visit your page and hit the
Send Message button you will notice that a messenger box appears. You can write
stuff to it, but the chatbot won’t answer. Let’s write the code to handle incoming
messages:
# Incoming message
elif request.method == 'POST':
event = request.get_json()
if event['object'] == 'page':
for entry in event['entry']:
try:
message_event = entry['messaging'][0]
message_handler(message_event['sender']['id'],
message_event['message'])
except IndexError:
pass
return "ok"
app.run('0.0.0.0', 9898)
What Next?
You now know a lot about Natural Language Processing tools, and best of all, you
know how to build them from scratch, just in case what’s already out there doesn’t
fit your needs. There’s a lot of room for improvements and this is not an exhaustive
resource on all things related to Natural Language Processing. The main purpose
of the book is to provide a solid basis for getting started with the subject. Please
make sure to follow my blog: http://nlpforhackers.io60 for new content. The plan is
to write content there and when I consider I gathered enough feedback I will either
update this book or start another one.
What content would you like to see in the book? Send me an email at: me@bogs.io
or contact me via the form here: https://nlpforhackers.io/contact/61
Thank you,
George-Bogdan Ivanov.
60 http://nlpforhackers.io
61 https://nlpforhackers.io/contact/