Text Categorization and Classification
Text Categorization and Classification
Text Categorization and Classification
Introduction
This might sound very abstract, but there are lots of situations nowadays, where companies
are in need of automatic classification or categorization of documents. Just think about a
large company with thousands of incoming mail pieces per day, both electronic or paper
based. Lot's of these mail pieces are without specific addressee names or departments.
Somebody has to read these texts and has to decide what kind of a letter it is ("change of
address", "complaints letter", "inquiry about products", and so on) and to whom the
document should be proceeded. This "somebody" can be an automated text classification
system.
Automated text classification, also called categorization of texts, has a history, which dates
back to the beginning of the 1960s. But the incredible increase in available online
documents in the last two decades, due to the expanding internet, has intensified and
renewed the interest in automated document classification and data mining. In the
beginning text classification focussed on heuristic methods, i.e. solving the task by applying
a set of rules based on expert knowledge. This approach proved to be highly inefficient, so
nowadays the focus has turned to fully automatic learning and clustering methods.
The task of text classification consists in assigning a document to one or more categories,
based on the semantic content of the document. Document (or text) classification runs in
two modes:
We will implement a text classifier in Python using Naive Bayes. Naive Bayes is the most
commonly used text classifier and it is the focus of research in text classification. A Naive
Bayes classifier is based on the application of Bayes' theorem with strong independence
assumptions. "Strong independence" means: the presence or absence of a particular feature
of a class is unrelated to the presence or absence of any other feature. Naive Bayes is well
suited for multiclass text classification.
Formal Definition
Let C = { c1, c2, ... cm} be a set of categories (classes) and D = { d 1, d2, ... dn} a set of
documents.
The task of the text classification consists in assigning to each pair ( c i, dj ) of C x D (with 1
≤ i ≤ m and 1 ≤ j ≤ n) a value of 0 or 1, i.e. the value 0, if the document d jdoesn't belong
to ci
d1 ... dj ... dn
c1 a11 ... a1j ... a1n
... ... ... ... ... ...
ci ai1 ... aij ... ain
... ... ... ... ... ...
cm am1 ... amj ... amn
Naive Bayes
Support Vector Machine
Nearest Neighbour
Naive Bayes Classifier
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with
strong (naïve) independence assumptions, i.e. an "independent feature model".
In other words: A naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other feature.
The number of times a word w t occurs within a document di will be denoted as Nit.
C
Nt denotes the number of times a word w t ocurs in all documents of a given class C.
P(di|cj) is 1, if di is labelled as cj, 0 otherwise
The probability for a class cj is the quotient of the number of Documents of c j and the
number of documents of all classes, i.e. the learn set:
Finally, we come to the formula we need to classify an unknown document, i.e. the
probability for a class cj given a document di:
Unfortunately, the formula of P(c|d i) we have just given is numerically not stable, because
the denominator can be zero due to rounding errors. We change this by calculating the
reciprocal and reformulate the expression as a sum of stable quotients:
We can rewrite the previous formula into the following form, our final Naive Bayes
classification formula, the one we will use in our Python implementation in the following
chapter:
1
We have transformed the standard formular for P(c|d), as it is used in many treatises , into
a numerically stable form.
We use a Naive Bayes classifier for our implementation in Python. The formal introduction
into the Naive Bayes approach can be found in our previous chapter.
Python is ideal for text classification, because of it's strong string class with powerful
methods. Furthermore the regular expression module re of Python provides the user with
tools, which are way beyond other programming languages.
The only downside might be that this Python implementation is not tuned for efficiency.
Document Representation
The document representation, which is based on the bag of word model, is illustrated in the
following diagram:
Imports Needed
Our implementation needs the regular expression module re and the os module:
import re, os
BagOfWords Class
class BagOfWords(object):
""" Implementing a bag of words, words corresponding with their
frequency of usages in a "document"
for usage by the Document class, DocumentClass class and the Pool
class."""
def __init__(self):
self.__number_of_words = 0
self.__bag_of_words = {}
def __add__(self,other):
""" Overloading of the "+" operator to join two BagOfWords """
erg = BagOfWords()
sum = erg.__bag_of_words
for key in self.__bag_of_words:
sum[key] = self.__bag_of_words[key]
if key in other.__bag_of_words:
sum[key] += other.__bag_of_words[key]
for key in other.__bag_of_words:
if key not in sum:
sum[key] = other.__bag_of_words[key]
return erg
def add_word(self,word):
""" A word is added in the dictionary __bag_of_words"""
self.__number_of_words += 1
if word in self.__bag_of_words:
self.__bag_of_words[word] += 1
else:
self.__bag_of_words[word] = 1
def len(self):
""" Returning the number of different words of an object """
return len(self.__bag_of_words)
def Words(self):
""" Returning a list of the words contained in the object """
return self.__bag_of_words.keys()
def BagOfWords(self):
""" Returning the dictionary, containing the words (keys) with
their frequency (values)"""
return self.__bag_of_words
def WordFreq(self,word):
""" Returning the frequency of a word """
if word in self.__bag_of_words:
return self.__bag_of_words[word]
else:
return 0
self._number_of_words = 0
for word in words:
self._words_and_freq.add_word(word)
if learn:
Document._vocabulary.add_word(word)
def __add__(self,other):
""" Overloading the "+" operator. Adding two documents consists in
adding the BagOfWords of the Documents """
res = Document(Document._vocabulary)
res._words_and_freq = self._words_and_freq + other._words_and_freq
return res
def vocabulary_length(self):
""" Returning the length of the vocabulary """
return len(Document._vocabulary)
def WordsAndFreq(self):
""" Returning the dictionary, containing the words (keys) with
their frequency (values) as contained
in the BagOfWords attribute of the document"""
return self._words_and_freq.BagOfWords()
def Words(self):
""" Returning the words of the Document object """
d = self._words_and_freq.BagOfWords()
return d.keys()
def WordFreq(self,word):
""" Returning the number of times the word "word" appeared in the
document """
bow = self._words_and_freq.BagOfWords()
if word in bow:
return bow[word]
else:
return 0
def Probability(self,word):
""" returns the probabilty of the word "word" given the class
"self" """
voc_len = Document._vocabulary.len()
SumN = 0
for i in range(voc_len):
SumN = DocumentClass._vocabulary.WordFreq(word)
N = self._words_and_freq.WordFreq(word)
erg = 1 + N
erg /= voc_len + SumN
return erg
def __add__(self,other):
""" Overloading the "+" operator. Adding two DocumentClass objects
consists in adding the
BagOfWords of the DocumentClass objectss """
res = DocumentClass(self._vocabulary)
res._words_and_freq = self._words_and_freq + other._words_and_freq
return res
def NumberOfDocuments(self):
return self._number_of_docs
d = Document(self.__vocabulary)
d.read_document(doc)
for j in self.__document_classes:
sum_j = self.sum_words_in_class(j)
prod = 1
for i in d.Words():
wf_dclass = 1 +
self.__document_classes[dclass].WordFreq(i)
wf = 1 + self.__document_classes[j].WordFreq(i)
r = wf * sum_dclass / (wf_dclass * sum_j)
prod *= r
prob += prod *
self.__document_classes[j].NumberOfDocuments() /
self.__document_classes[dclass].NumberOfDocuments()
if prob != 0:
return 1 / prob
else:
return -1
else:
prob_list = []
for dclass in self.__document_classes:
prob = self.Probability(doc, dclass)
prob_list.append([dclass,prob])
prob_list.sort(key = lambda x: x[1], reverse = True)
return prob_list
base = "learn/"
p = Pool()
for i in DClasses:
p.learn(base + i, i)
base = "test/"
for i in DClasses:
dir = os.listdir(base + i)
for file in dir:
res = p.Probability(base + i + "/" + file)
print(i + ": " + file + ": " + str(res))