Python With Textblob

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Introduction

Spelling mistakes are common, and most people are used to software
indicating if a mistake was made. From autocorrect on our phones, to red
underlining in text editors, spell checking is an essential feature for many
different products.

The first program to implement spell checking was written in 1971 for
the DEC PDP-10. Called SPELL, it was capable of performing only simple
comparisons of words and detecting one or two letter differences. As
hardware and software advanced, so have spell checkers. Modern spell
checkers are capable of handling morphology and using statistics to
improve suggestions.

Python offers many modules to use for this purpose, making writing a
simple spell checker an easy 20-minute ordeal.

One of these libraries being TextBlob, which is used for natural language


processing that provides an intuitive API to work with.

In this article we'll take a look at how to implement spelling correction in


Python with TextBlob.

Installation

First, we'll need to install TextBlob, since it doesn't come preinstalled. Open


up a console and install it using pip:
$ pip install textblob

This should install everything we need for this project. Upon finishing the
installation, the console output should include something like:

Successfully installed click-7.1.2 joblib-0.17.0 nltk-3.5 regex-2020.11.13

textblob-0.15.3

TextBlob is built on top of NLTK, so it also comes with the installation.

The correct() Function

The most straightforward way to correct input text is to use


the  correct()  method. The example text we'll be using is a paragraph from
Charles Darwin's "On the Origin of Species", which is part of the public
domain, packed into a file called  text.txt .
In addition, we'll add some deliberate spelling mistakes:

As far as I am abl to judg, after long attnding to the sbject, the condiions of

lfe apear to act in two ways—directly on the whle organsaton or on certin parts

alne and indirectly by afcting the reproducte sstem. Wit respct to te dirct

action, we mst bea in mid tht in every cse, as Profesor Weismann hs latly

insistd, and as I have inidently shwn in my wrk on "Variatin undr

Domesticcation," thcere arae two factrs: namly, the natre of the orgnism and

the natture of the condiions. The frmer sems to be much th mre importannt; foor

nealy siimilar variations sometimes aris under, as far as we cn juddge,

disimilar conditios; annd, on te oter hannd, disssimilar variatioons arise

undder conditions which aappear to be nnearly uniiform. The efffects on tthe

offspring arre ieither definnite or in definite. They maay be considdered as

definnite whhen allc or neearly all thhe ofefspring off inadividuals exnposed

tco ceertain conditionas duriing seveal ggenerations aree moodified in te saame

maner.

It's full of spelling mistakes, in almost every word. Let's write up a simple


script, using TextBlob, to correct these mistakes and print them back to the
console:

from textblob import TextBlob

with open ( "text.txt" , "r" ) as f: # Opening the test file with


the intention to read
text = f.read() # Reading the file

textBlb = TextBlob(text) # Making our first textblob

textCorrected = textBlb.correct() # Correcting the text

print (textCorrected)

If you've worked with TextBlob before, this flow will look familiar to you.
We've read the file and the contents inside of it, and constructed
a  TextBlob  instance by passing the contents to the constructor.
Then, we run the  correct()  function on that instance to perform spelling
correction.

After running the script above, you should get an output similar to:

Is far as I am all to judge, after long attending to the subject, the

conditions of life appear to act in two ways—directly on the while organisation

or on certain parts alone and indirectly by acting the reproduce system. It

respect to te direct action, we must be in mid the in every case, as Professor

Weismann he lately insisted, and as I have evidently shown in my work on

"Variation under Domesticcation," there are two facts: namely, the nature of

the organism and the nature of the conditions. The former seems to be much th

are important; for nearly similar variations sometimes arms under, as far as we

in judge, similar condition; and, on te other hand, disssimilar variations

arise under conditions which appear to be nearly uniform. The effects on the

offspring are either definite or in definite. They may be considered as

definite when all or nearly all the offspring off individuals exposed to

certain conditions during several generations are modified in te same manner.

How Correct is TextBlob's Spelling Correction?

As we can see, the text still has some spelling errors. Words
like  "abl"  were supposed to be  "able" , not  "all" . Though, even with these,
it's still better than the original.

Now comes the question, how much better is it?

The following code snippet is a simple script that test how good
is TextBlob in correcting errors, based on this example:

from textblob import TextBlob

# A function that compares two texts and returns


# the number of matches and differences
def compare(text1, text2):
l1 = text1.split()

l2 = text2.split()
good = 0

bad = 0

for i in range ( 0 , len (l1)):

if l1[i] != l2[i]:

bad += 1

else :

good += 1

return (good, bad)

# Helper function to calculate the percentage of misspelled words


def percentageOfBad(x):
return (x[ 1 ] / (x[ 0 ] + x[ 1 ])) * 100

Now, with those two functions, let's run a quick analysis:

with open ( "test.txt" , "r" ) as f1: # test.txt contains the same typo-
filled text from the last example
t1 = f1.read()

with open ( "original.txt" , "r" ) as f2: # original.txt contains the


text from the actual book
t2 = f2.read()

t3 = TextBlob(t1).correct()

mistakesCompOriginal = compare(t1, t2)

originalCompCorrected = compare(t2, t3)

mistakesCompCorrected = compare(t1, t3)

print ( "Mistakes compared to original " , mistakesCompOriginal)

print ( "Original compared to corrected " , originalCompCorrected)

print ( "Mistakes compared to corrected " , mistakesCompCorrected, "\n" )

print ( "Percentage of mistakes in the test: " ,

percentageOfBad(mistakesCompOriginal), "%" )
print ( "Percentage of mistakes in the corrected: " ,

percentageOfBad(originalCompCorrected), "%" )

print ( "Percentage of fixed mistakes: " ,

percentageOfBad(mistakesCompCorrected), "%" , "\n" )

Running it will print out:

Mistakes compared to original (126, 194)

Original compared to corrected (269, 51)

Mistakes compared to corrected (145, 175)

Percentage of mistakes in the test: 60.62499999999999 %

Percentage of mistakes in the corrected: 15.937499999999998 %

Percentage of fixed mistakes: 54.6875 %

As we can see, the  correct  method managed to get our spelling mistake
percentage from 60.6% to 15.9%, which is pretty decent, however there's a
bit of a catch. It corrected 54.7% of the words, so why is there still a 15.9%
mistake rate?

Download the eBook  

The answer is overcorrection. Sometimes, it can change a word that is


spelled correctly, like the first word in our example text where  "As"  was
corrected to  "Is" . Other times, it just doesn't have enough information
about the word and the context to tell which word the user was intending to
type, so it guesses that it should replace  "whl"  with  "while"  instead
of  "whole" .

There is no perfect spelling corrector because so much of spoken language


is contextual, so keep that in mind. In most use cases, there are way fewer
mistakes than in our example, so TextBlob should be able to work well
enough for the average user.

You might also like