A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
Abstract— This research investigates a comparison between two II. RELATED WORK
different approaches for classifying emails based on their
This paper focuses on distinguishing important emails from
categories. Naive Bayes and Hidden Markov Model (HMM), two
different machine learning algorithms, both have been used for spam emails. One major factor in the categorization is that of
detecting whether an email is important or spam. Naive Bayes how to represent the messages. Specifically, one must decide
Classifier is based on conditional probabilities. It is fast and works which features to use, and how to apply those features to the
great with small dataset. It considers independent words as a categorization. M. Aery et al. [1] gave an approach which is
feature. HMM is a generative, probabilistic model that provides us based on the premise that patterns can be extracted from a pre-
with distribution over the sequences of observations. HMMs can classified email folder and the same can be used effectively for
handle inputs of variable length and help programs come to the classifying incoming emails. As emails consists a format in the
most likely decision, based on both previous decisions and current form of headers and body of the email, the correlation between
data. Various combinations of NLP techniques- stopwords
different terms can be showed in the form graph. They have
removing, stemming, lemmatizing have been tried on both the
algorithms to inspect the differences in accuracy as well as to find chosen graph mining as a viable technique for pattern extraction
the best method among them. and classification. R. Islam et al. [2] showed a way which
proposed a multi-stage classification technique using different
Index Terms— Email Classification, Hidden Markov Model, popular learning algorithms such as SVM, Naive Bayes and
Naive Bayes, Natural Language Processing, NLTK, Supervised boosting with an analyzer which reduces the False Precision
Learning. substantially and increases classification accuracy compared to
similar existing techniques. B. Klimt et al [3] gave an approach
that introduced Enron corpus as a new dataset for this domain.
I. INTRODUCTION V. Bhat et al. [4] came up with an approach which derives spam
Email is one of the most important means of communication filter called Beaks. They classify emails into spam and non-
in today’s world. Email usage has increased substantially spam. Their pre-processing technique is designed to identify
around the world. In 2015, the number of emails sent and tag-of-spam words relevant to the dataset. X. Wang et al. [5]
received per day totaled over 205 billion. This figure is took an approach which reviews recent approaches to filter out
expected to grow at an average annual rate of 3% over the next spam email, to categorize email into a hierarchy of folders, and
four years, reaching over 246 billion by the end of 20191. As of to automatically determine the tasks required in response to an
December 2016, spam messages accounted for 61.66 percent of email. According to E.Yitagesul et al [6], in sender based
email traffic worldwide2. Therefore, filtering these spam emails detection, the email sender information such as the writing style
has become a crying need for email users around the globe. This and the email sender user name is used as the major features.
paper describes the methodologies that can be used to classify The research paper written by S.Teli [7] showed us a three
emails into different categories such as important and spam. phased system that they engineered for their way of spam
Relative words or sentences have been considered as feature to detection. In the first phase, the user creates the rule for
classify email messages. The difference in nature between classification. Rules are nothing, but the keywords/phrases that
Naïve Bayes Classifier and Hidden Markov Model makes it occur in mails for respective legitimate or spam mails. The
interesting to compare them. Dataset has been collected and second phase can be called as training phase. Here the classifier
pre-processed before evaluating accuracy, precision, recall, f- will be trained using a spam and legitimate emails manually by
metrics for both the algorithms. Stemming, lemmatizing, the user. Then with the help of algorithm the keywords are
removal of stopwords- these techniques have been used in extracted from classified emails. When the first and second
various combinations with the algorithms to analyze which phases are completed, classifying the emails by given algorithm
algorithm on what combination gives the best result. starts, using this knowledge of tokens, the filter classifies every
new incoming email. Here the probability of maximum
1 2
http://www.radicati.com/wp/wp-content/uploads/2015/02/Email- https://www.statista.com/statistics/420391/spam-email-traffic-share/
Statistics-Report-2015-2019-Executive-Summary.pdf
483
could be labeled as both important and spam in the early observe S. Y = {Y1, Y2, Y3 ....Yt} is a sequence of emissions,
stage, this ratio helps to determine the ultimate label of any we observe Y. Y2 is conditionally independent of everything
such word. If the ratio is 1, only in that case a word is else given S2. S4 is conditionally independent of everything else
considered to be both important and spam. Thus word:label given S3. Probability of being in a particular state at step i is
pairs were created where the most informative features known once we know what state we were in at step i-1.
(words that appeared most frequently in the email dataset) Probability of seeing a particular emission at step i is known
were labeled as important or spam. The spam words were once we know what state we were in at step i. The joint
extracted and stored in a list called spamwords, similarly the distribution of a sequence of states and observations can be
factored in the following way [14]
ሺ
484
determine the email category. Otherwise, the email was
classified as a spam email. HMM algorithm was also run with
eight different combinations of NLP techniques like Naive
Bayes algorithm. This time a different result was obtained.
HMM with stemming outperformed all the other seven variants.
After that, the best result of Naive Bayes algorithm was
compared with the best result of HMM. Overall HMM was
more successful in identifying spam and important emails from
the test dataset.
Stop
5500 200 83 57 13 25 4 18 78.65
Words
Stemming
5500 200 85 52 7 40 8 8 74.46 Fig 5.1: Accuracy Comparisons (Naive Bayes)
Lemmatizing 5500 200 72 58 16 29 12 13 74.29
Figure 5.1 shows accuracy comparison among 8 different
Stop Words 5500 200 74 58 14 30 12 12 75.00 combinational approach to Naive Bayes and it shows that using
+
Stemming
stop words and lemmatizing together gives the best accuracy
Stop Words
5500 200 70 67 16 20 14 13 79.19 result which is 79.19%.
+
Lemmatizing Process Accuracy Precision Recall F-Measure
Lemmatizing 5500 200 85 52 6 41 9 7 74.46
+ Positive Negative Positive Negative Positive Negative
Stemming
Stop Words+ 5500 200 72 59 15 29 13 12 74.86
Basic 78.28 0.82 0.75 0.75 0.82 0.78 0.78
Lemmatizing+
Stemming
Stop
78.65 0.77 0.81 0.86 0.7 0.81 0.75
Words
Then again using only stop words, out of 100 important emails
Stop Words 74.86 0.71 0.8 0.83 0.67 0.77 0.73
+
83 instances were classified correctly, 13 instances were Lemmatizing
485
measure 0.81 and Lemmatizing gives the lowest positive F- Figure 5.16 shows accuracy comparison among 8 different
measure 0.76. We also see that, Lemmatizing and Stemming combinational approach to Hidden Markov Model and it shows
together gives the highest negative precision 0.90 and basic that using only stemming gives the best accuracy result which
naive Bayes gives the lowest negative precision 0.75. Then is 91.28%.
basic naive Bayes gives highest negative recall 0.82 and
Process Accuracy Precision Recall F-Measure
Lemmatizing and Stemming together gives lowest negative
recall 0.56. Lastly, Stopwords Removal and Lemmatizing
together gives the highest negative F-measure 0.79 and both Positive Negative Positive Negative Positive Negative
Stemming alone and Lemmatizing and Stemming together
gives the lowest negative F-measure 0.69. Basic
85.19 0.88 0.83 0.82 0.88 0.85 0.85
Process Trained Test Correctly Incorrectly Indeterminate Accuracy
Stop
Data data Classified Classified instances (Mail
Words
84.04 0.82 0.87 0.89 0.79 0.85 0.83
instances instances Classifi--
cation)
Stemming
Impor-
-tant
Spam Impor-
-tant
Spam Impor-
-tant
Spam 91.28 0.93 0.90 0.90 0.93 0.91 0.91
Basic
5500 200 80 81 17 11 3 8 85.19 Lemmatizing 83.33 0.83 0.84 0.85 0.82 0.84 0.83
Stop
Words 5500 200 87 71 11 19 2 10 84.04
Stemming Stop Words 85.05 0.81 0.91 0.93 0.77 0.86 0.84
5500 200 87 91 10 7 3 2 91.28 +
Lemmatizing Stemming
5500 200 83 77 15 17 2 6 83.33 Stop Words
+
81.05 0.79 0.84 0.87 0.75 0.83 0.79
Stop Words
+
5500 200 91 74 7 22 2 4 85.05 Lemmatizing
Stemming Lemmatizing
Stop Words +
91.15 0.92 0.90 0.9 0.93 0.91 0.91
+
5500 200 86 68 13 23 1 9 81.05 Stemming
Lemmatizing Stop Words
Lemmatizing +
84.46 0.80 0.91 0.93 0.76 0.86 0.83
+
5500 200 86 89 10 7 4 4 91.15
Lemmatizing
Stemming
+
Stop Words Stemming
+
5500 200 91 72 7 23 2 5 84.46
Lemmatizing
+ Table 5.4: Precision, Recall and F-Measure of Test Set (HMM in
Stemming
Different Processes)
Table 5.3: Evaluation on Test Set (Hidden Markov Model in Different
Processes)
After running both Naive Bayes and HMM algorithm in 8
Evaluating Test Set for HMM important emails were used and
combinations of 3 different processes along with basic
out of 100 important emails 91 instances were classified
approach we find different classification accuracy for different
correctly, 7 instances were classified incorrectly and 2 instances
could not be determined which gives us the best accuracy of
91.28%. Process Accuracy
(Mail Classification)
Naive Bayes Algorithm HMM Algorithm
Table 5.5, Fig 5.3, 5.4 shows that stemming in HMM gives the
most accuracy
5 REFERENCES
[1] Aery, M., & Chakravarthy, S. (2005). eMailSift: eMail
classification based on structure and content. Data Mining,
Fifth IEEE Int., 2005. IEEE. doi:10.1109/ICDM.2005.58
[2] Islam, R., & Zhou, W. (2007). Email Categorization
Using Multi-stage Classification Technique. Parallel and
Distributed Computing, Applications and Technologies, 2007.
PDCAT '07. Eighth International Conference on, 2007. IEEE.
doi:10.1109/PDCAT.2007.71
[3] Klimt, B., & Yang, Y. (2004). The Enron Corpus: A
New Dataset for Email Classification Research. Boulicaut JF.,
Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning:
ECML 2004. ECML 2004. Springer, Berlin, Heidelberg.
Lecture Notes in Computer Science, 3201.
[4] Bhat, V. H., Malkani, V. R., Shenoy, P. D., Venugopal,
K. R., & Patnaik, L. M. (2011). Classification of email using
BeaKS: Behavior and keyword stemming. TENCON 2011 -
2011 IEEE Region 10 Conference, Bali, 2011. IEEE. 1139-
1143. doi: 10.1109/TENCON.2011.6129290
[5] Wang, X., & Cloete, I.A.N. (2005). Learning to classify
email: a survey. Proceedings of the international conference on
Fig 5.3: Accuracy Comparison between Naive Bayes and HMM
machine learning and, cybernetics, 2005, 9, 18-21.
Algorithm
[6] Yitagesu1, M. E., & Tijare, M. (2016). Email
Classification using Classification Method. International
Journal of Engineering Trends and Technology (IJETT), 32(3),
142.
[7] Teli, S. P., & Biradar, S. (2014). Effective Email
Classification for Spam and Non-Spam. International Journal
of Advanced Research in Computer Science and Software
Engineering, 4(6).
[8] Naive-Bayes Classification Algorithm [Online].
Accessed September 2016. Retrieved from
http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab4
-NaiveBayes.pdf
[9] Bird, S., Klein, E., & Loper, E. (2009). Natural
Language Processing with Python. O'Reilly Media.
[10] Enron Email Dataset [Online]. Accessed June 2016.
Retrieved from https://www.cs.cmu.edu/~./enron/
[11] Jackson, P., & Moulinier, I. (2002). Natural Language
Processing for Online Applications: Text Retrieval, Extraction
and Categorization. Amsterdam/Philadelphia: John Benjamins
Publishing Company.
[12] Saraiya, S. U., & Desai, N. (2015). Content Based
Fig 5.4: Accuracy Comparison between Naive Bayes and HMM Categorization of E-Mail using Hidden Markov Model
Algorithm Approach. IEEE International Conference on Advances in
Engineering and Technology-ICAET 2014, Tamil-Nadu,
Conclusion Chennai. IEEE. doi: 10.13140/RG.2.1.3070.9602
[13] Gharamani, Z. (2001). An Introduction to Hidden
In summary, we propose a comparative approach to email
Markov Models and Bayesian Networks. Journal of Pattern
classification using Naive Bayes Classifier and HMM. We
Recognition and Artificial Intelligence. 15(1), 9-42.
categorize emails by considering only text part from body of the
[14] Jivani, A. G. (2011). A Comparative Study of Stemming
message. Because we consider relative words and sentences as
Algorithms. International Journal of Computer Technology and
feature. After running the same variants on both the algorithms,
we compared the results and used HMM for classification Applications. 2(6), 1930-1938.
because it gave better accuracy. The structure of our research
has been built in such a way that with proper dataset and minor
487