Text Categorization and Classification

Text Categorization and Classification
Introduction
Document classification/categorization is a topic in information science, a science dealing

with the collection, analysis, classification, categorization, manipulation, retrieval, storage
and propagation of information.
This might sound very abstract, but there are lots of situations nowadays, where companies
are in need of automatic classification or categorization of documents. Just think about a
large company with thousands of incoming mail pieces per day, both electronic or paper
based. Lot's of these mail pieces are without specific addressee names or departments.
Somebody has to read these texts and has to decide what kind of a letter it is ("change of
address", "complaints letter", "inquiry about products", and so on) and to whom the
document should be proceeded. This "somebody" can be an automated text classification
system.
Automated text classification, also called categorization of texts, has a history, which dates
back to the beginning of the 1960s. But the incredible increase in available online
documents in the last two decades, due to the expanding internet, has intensified and
renewed the interest in automated document classification and data mining. In the
beginning text classification focussed on heuristic methods, i.e. solving the task by applying
a set of rules based on expert knowledge. This approach proved to be highly inefficient, so
nowadays the focus has turned to fully automatic learning and clustering methods.
The task of text classification consists in assigning a document to one or more categories,
based on the semantic content of the document. Document (or text) classification runs in
two modes:
 The training phase and the

 prediction (or classification) phase.
The training phase can be divided into three kinds:
 supervised document classification is performed by an external mechanism, usually

human feedback, which provides the necessary information for the correct
classification of documents,
 semi-supervised document classification, a mixture between supervised and
unsupervised classification: some documents or parts of documents are labelled by
external assistance,
 unsupervised document classification is entirely executed without reference to
external information.
We will implement a text classifier in Python using Naive Bayes. Naive Bayes is the most
commonly used text classifier and it is the focus of research in text classification. A Naive
Bayes classifier is based on the application of Bayes' theorem with strong independence
assumptions. "Strong independence" means: the presence or absence of a particular feature
of a class is unrelated to the presence or absence of any other feature. Naive Bayes is well
suited for multiclass text classification.
Formal Definition
Let C = { c1, c2, ... cm} be a set of categories (classes) and D = { d 1, d2, ... dn} a set of
documents.
The task of the text classification consists in assigning to each pair ( c i, dj ) of C x D (with 1
≤ i ≤ m and 1 ≤ j ≤ n) a value of 0 or 1, i.e. the value 0, if the document d jdoesn't belong
to ci
This mapping is sometimes referred to as the decision matrix:
d1 ... dj ... dn
c1 a11 ... a1j ... a1n
... ... ... ... ... ...
ci ai1 ... aij ... ain
... ... ... ... ... ...
cm am1 ... amj ... amn
The main approaches to solve this task are:
 Naive Bayes
 Support Vector Machine
 Nearest Neighbour
Naive Bayes Classifier
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with
strong (naÃ¯ve) independence assumptions, i.e. an "independent feature model".
In other words: A naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other feature.
Formal Derivation of the Naive Bayes Classifier:
Let C = { c1, c2, ... cm} be a set of classes or

categories and D = { d1, d2, ... dn} be a set of documents.
Each document is labeled with a class.
The set D of documents is used to train the classifier.

Classification consists in selecting the most probable class for an unknown document.
The number of times a word w t occurs within a document di will be denoted as Nit.
C
Nt denotes the number of times a word w t ocurs in all documents of a given class C.
P(di|cj) is 1, if di is labelled as cj, 0 otherwise
The probability for a word wt given a class cj:
The probability for a class cj is the quotient of the number of Documents of c j and the
number of documents of all classes, i.e. the learn set:
Finally, we come to the formula we need to classify an unknown document, i.e. the
probability for a class cj given a document di:
Unfortunately, the formula of P(c|d i) we have just given is numerically not stable, because
the denominator can be zero due to rounding errors. We change this by calculating the
reciprocal and reformulate the expression as a sum of stable quotients:
We can rewrite the previous formula into the following form, our final Naive Bayes
classification formula, the one we will use in our Python implementation in the following
chapter:
Text Classification in Python

In the previous chapter, we have deduced the formula for calculating the probability that a
document d belongs to a category or class c, denoted as P(c|d).
1
We have transformed the standard formular for P(c|d), as it is used in many treatises , into
a numerically stable form.
We use a Naive Bayes classifier for our implementation in Python. The formal introduction
into the Naive Bayes approach can be found in our previous chapter.
Python is ideal for text classification, because of it's strong string class with powerful
methods. Furthermore the regular expression module re of Python provides the user with
tools, which are way beyond other programming languages.
The only downside might be that this Python implementation is not tuned for efficiency.
The Python Implementation

Warning: The implementation is in Python 3 and it is not compatible to Python 2.x!
Document Representation
The document representation, which is based on the bag of word model, is illustrated in the
following diagram:
Imports Needed
Our implementation needs the regular expression module re and the os module:
import re, os
BagOfWords Class
class BagOfWords(object):
""" Implementing a bag of words, words corresponding with their
frequency of usages in a "document"
for usage by the Document class, DocumentClass class and the Pool
class."""
def __init__(self):
self.__number_of_words = 0
self.__bag_of_words = {}
def __add__(self,other):
""" Overloading of the "+" operator to join two BagOfWords """
erg = BagOfWords()
sum = erg.__bag_of_words
for key in self.__bag_of_words:
sum[key] = self.__bag_of_words[key]
if key in other.__bag_of_words:
sum[key] += other.__bag_of_words[key]
for key in other.__bag_of_words:
if key not in sum:
sum[key] = other.__bag_of_words[key]
return erg
def add_word(self,word):
""" A word is added in the dictionary __bag_of_words"""
self.__number_of_words += 1
if word in self.__bag_of_words:
self.__bag_of_words[word] += 1
else:
self.__bag_of_words[word] = 1
def len(self):
""" Returning the number of different words of an object """
return len(self.__bag_of_words)
def Words(self):
""" Returning a list of the words contained in the object """
return self.__bag_of_words.keys()
def BagOfWords(self):
""" Returning the dictionary, containing the words (keys) with
their frequency (values)"""
return self.__bag_of_words
def WordFreq(self,word):
""" Returning the frequency of a word """
if word in self.__bag_of_words:
return self.__bag_of_words[word]
else:
return 0
The Document Class

class Document(object):
""" Used both for learning (training) documents and for testing
documents. The optional parameter lear
has to be set to True, if a classificator should be trained. If it is a
test document learn has to be set to False. """
_vocabulary = BagOfWords()
def __init__(self, vocabulary):

self.__name = ""
self.__document_class = None
self._words_and_freq = BagOfWords()
Document._vocabulary = vocabulary
def read_document(self,filename, learn=False):

""" A document is read. It is assumed that the document is either
encoded in utf-8 or in iso-8859... (latin-1).
The words of the document are stored in a Bag of Words, i.e.
self._words_and_freq = BagOfWords() """
try:
text = open(filename,"r", encoding='utf-8').read()
except UnicodeDecodeError:
text = open(filename,"r", encoding='latin-1').read()
text = text.lower()
words = re.split(r"\W",text)
self._number_of_words = 0
for word in words:
self._words_and_freq.add_word(word)
if learn:
Document._vocabulary.add_word(word)
""" Overloading the "+" operator. Adding two documents consists in
adding the BagOfWords of the Documents """
res = Document(Document._vocabulary)
res._words_and_freq = self._words_and_freq + other._words_and_freq
return res
def vocabulary_length(self):
""" Returning the length of the vocabulary """
return len(Document._vocabulary)
def WordsAndFreq(self):
""" Returning the dictionary, containing the words (keys) with
their frequency (values) as contained
in the BagOfWords attribute of the document"""
return self._words_and_freq.BagOfWords()
def Words(self):
""" Returning the words of the Document object """
d = self._words_and_freq.BagOfWords()
return d.keys()
def WordFreq(self,word):
""" Returning the number of times the word "word" appeared in the
document """
bow = self._words_and_freq.BagOfWords()
if word in bow:
return bow[word]
else:
return 0
def __and__(self, other):

""" Intersection of two documents. A list of words occuring in both
documents is returned """
intersection = []
words1 = self.Words()
for word in other.Words():
if word in words1:
intersection += [word]
return intersection
The DocumentClass Class

The class DocumentClass is the class for our Document categories. It's inheriting from our
Document class.
class DocumentClass(Document):
def __init__(self, vocabulary):
Document.__init__(self, vocabulary)
self._number_of_docs = 0
def Probability(self,word):
""" returns the probabilty of the word "word" given the class
"self" """
voc_len = Document._vocabulary.len()
SumN = 0
for i in range(voc_len):
SumN = DocumentClass._vocabulary.WordFreq(word)
N = self._words_and_freq.WordFreq(word)
erg = 1 + N
erg /= voc_len + SumN
return erg
""" Overloading the "+" operator. Adding two DocumentClass objects
consists in adding the
BagOfWords of the DocumentClass objectss """
res = DocumentClass(self._vocabulary)
res._words_and_freq = self._words_and_freq + other._words_and_freq
return res
def SetNumberOfDocs(self, number):

self._number_of_docs = number
def NumberOfDocuments(self):
return self._number_of_docs
The Pool class

The pool is the class, where the document classes are learnt and kept:
class Pool(object):
def __init__(self):
self.__document_classes = {}
self.__vocabulary = BagOfWords()
def sum_words_in_class(self, dclass):

""" The number of times all different words of a dclass appear in a
class """
sum = 0
for word in self.__vocabulary.Words():
WaF = self.__document_classes[dclass].WordsAndFreq()
if word in WaF:
sum += WaF[word]
return sum
def learn(self, directory, dclass_name):

""" directory is a path, where the files of the class with the name
dclass_name can be found """
x = DocumentClass(self.__vocabulary)
dir = os.listdir(directory)
for file in dir:
d = Document(self.__vocabulary)
print(directory + "/" + file)
d.read_document(directory + "/" + file, learn = True)
x = x + d
self.__document_classes[dclass_name] = x
x.SetNumberOfDocs(len(dir))
def Probability(self, doc, dclass = ""):

"""Calculates the probability for a class dclass given a document
doc"""
if dclass:
sum_dclass = self.sum_words_in_class(dclass)
prob = 0
d.read_document(doc)
for j in self.__document_classes:
sum_j = self.sum_words_in_class(j)
prod = 1
for i in d.Words():
wf_dclass = 1 +
self.__document_classes[dclass].WordFreq(i)
wf = 1 + self.__document_classes[j].WordFreq(i)
r = wf * sum_dclass / (wf_dclass * sum_j)
prod *= r
prob += prod *
self.__document_classes[j].NumberOfDocuments() /
self.__document_classes[dclass].NumberOfDocuments()
if prob != 0:
return 1 / prob
else:
return -1
else:
prob_list = []
for dclass in self.__document_classes:
prob = self.Probability(doc, dclass)
prob_list.append([dclass,prob])
prob_list.sort(key = lambda x: x[1], reverse = True)
return prob_list
def DocumentIntersectionWithClasses(self, doc_name):

res = [doc_name]
for dc in self.__document_classes:
d.read_document(doc_name, learn=False)
o = self.__document_classes[dc] & d
intersection_ratio = len(o) / len(d.Words())
res += (dc, intersection_ratio)
return res
Using the Classifier

To be able to learn and test a classifier, wie offer a "Learn and test set to Download". The
module NaiveBayes consists of the code we have provided so far, but it can be downloaded
for convenience as NaiveBayes.py The learn and test sets contain (old) jokes labelled in six
categories: "clinton", "lawyer", "math", "medical", "music", "sex".
from NaiveBayes import Pool
import os
DClasses = ["clinton", "lawyer", "math", "medical", "music", "sex"]
base = "learn/"
p = Pool()
for i in DClasses:
p.learn(base + i, i)
base = "test/"
for i in DClasses:
dir = os.listdir(base + i)
for file in dir:
res = p.Probability(base + i + "/" + file)
print(i + ": " + file + ": " + str(res))
The output of the previous program is contained in the file NaiveBayes_example_results.txt

Text Categorization and Classification

Uploaded by

Copyright:

Available Formats

Text Categorization and Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Categorization and Classification

Uploaded by

Copyright:

Available Formats

Text Categorization and Classification

Document classification/categorization is a topic in information science, a science dealing

 The training phase and the

 supervised document classification is performed by an external mechanism, usually

This mapping is sometimes referred to as the decision matrix:

The main approaches to solve this task are:

Formal Derivation of the Naive Bayes Classifier:

Let C = { c1, c2, ... cm} be a set of classes or

The set D of documents is used to train the classifier.

The probability for a word wt given a class cj:

Text Classification in Python

The Python Implementation

The Document Class

def __init__(self, vocabulary):

def read_document(self,filename, learn=False):

def __and__(self, other):

The DocumentClass Class

def SetNumberOfDocs(self, number):

The Pool class

def sum_words_in_class(self, dclass):

def learn(self, directory, dclass_name):

def Probability(self, doc, dclass = ""):

def DocumentIntersectionWithClasses(self, doc_name):

Using the Classifier

DClasses = ["clinton", "lawyer", "math", "medical", "music", "sex"]

The output of the previous program is contained in the file NaiveBayes_example_results.txt

You might also like

def init(self, vocabulary):

def and(self, other):