A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model

Proceedings of the 2017 4th International Conference on Advances in Electrical Engineering (ICAEE),
28-30 September, Dhaka, Bangladesh
A Comparative Approach to Email Classification Using Naive

Bayes Classifier and Hidden Markov Model
Sebastian Romy Gomes, Sk Golam Saroar, Md Mosfaiul Alam Telot, Behroz Newaz Khan,
Amitabha Chakrabarty, Mr. Moin Mostakim
Department of Computer Science & Engineering
BRAC University
Mohakhali, 66 Bir Uttam AK Khandakar Road, Dhaka 1212, Bangladesh
Email: romy6047@gmail.com, saroar.bracu@gmail.com, aetelot@gmail.com, behroznewaz,@gmail.com, amitabha@bracu.ac.bd,
shisshir14@gmail.com
Abstract— This research investigates a comparison between two II. RELATED WORK
different approaches for classifying emails based on their
This paper focuses on distinguishing important emails from
categories. Naive Bayes and Hidden Markov Model (HMM), two
different machine learning algorithms, both have been used for spam emails. One major factor in the categorization is that of
detecting whether an email is important or spam. Naive Bayes how to represent the messages. Specifically, one must decide
Classifier is based on conditional probabilities. It is fast and works which features to use, and how to apply those features to the
great with small dataset. It considers independent words as a categorization. M. Aery et al. [1] gave an approach which is
feature. HMM is a generative, probabilistic model that provides us based on the premise that patterns can be extracted from a pre-
with distribution over the sequences of observations. HMMs can classified email folder and the same can be used effectively for
handle inputs of variable length and help programs come to the classifying incoming emails. As emails consists a format in the
most likely decision, based on both previous decisions and current form of headers and body of the email, the correlation between
data. Various combinations of NLP techniques- stopwords
different terms can be showed in the form graph. They have
removing, stemming, lemmatizing have been tried on both the
algorithms to inspect the differences in accuracy as well as to find chosen graph mining as a viable technique for pattern extraction
the best method among them. and classification. R. Islam et al. [2] showed a way which
proposed a multi-stage classification technique using different
Index Terms— Email Classification, Hidden Markov Model, popular learning algorithms such as SVM, Naive Bayes and
Naive Bayes, Natural Language Processing, NLTK, Supervised boosting with an analyzer which reduces the False Precision
Learning. substantially and increases classification accuracy compared to
similar existing techniques. B. Klimt et al [3] gave an approach
that introduced Enron corpus as a new dataset for this domain.
I. INTRODUCTION V. Bhat et al. [4] came up with an approach which derives spam
Email is one of the most important means of communication filter called Beaks. They classify emails into spam and non-
in today’s world. Email usage has increased substantially spam. Their pre-processing technique is designed to identify
around the world. In 2015, the number of emails sent and tag-of-spam words relevant to the dataset. X. Wang et al. [5]
received per day totaled over 205 billion. This figure is took an approach which reviews recent approaches to filter out
expected to grow at an average annual rate of 3% over the next spam email, to categorize email into a hierarchy of folders, and
four years, reaching over 246 billion by the end of 20191. As of to automatically determine the tasks required in response to an
December 2016, spam messages accounted for 61.66 percent of email. According to E.Yitagesul et al [6], in sender based
email traffic worldwide2. Therefore, filtering these spam emails detection, the email sender information such as the writing style
has become a crying need for email users around the globe. This and the email sender user name is used as the major features.
paper describes the methodologies that can be used to classify The research paper written by S.Teli [7] showed us a three
emails into different categories such as important and spam. phased system that they engineered for their way of spam
Relative words or sentences have been considered as feature to detection. In the first phase, the user creates the rule for
classify email messages. The difference in nature between classification. Rules are nothing, but the keywords/phrases that
Naïve Bayes Classifier and Hidden Markov Model makes it occur in mails for respective legitimate or spam mails. The
interesting to compare them. Dataset has been collected and second phase can be called as training phase. Here the classifier
pre-processed before evaluating accuracy, precision, recall, f- will be trained using a spam and legitimate emails manually by
metrics for both the algorithms. Stemming, lemmatizing, the user. Then with the help of algorithm the keywords are
removal of stopwords- these techniques have been used in extracted from classified emails. When the first and second
various combinations with the algorithms to analyze which phases are completed, classifying the emails by given algorithm
algorithm on what combination gives the best result. starts, using this knowledge of tokens, the filter classifies every
new incoming email. Here the probability of maximum
1 2
http://www.radicati.com/wp/wp-content/uploads/2015/02/Email- https://www.statista.com/statistics/420391/spam-email-traffic-share/
Statistics-Report-2015-2019-Executive-Summary.pdf
978-1-5386-0869-2/17/$31.00 ©2017 IEEE 482

keyword match is calculated and the status of a new email is techniques- stemming, lemmatizing and stopwords removing
confirmed as spam or important email. Two main methods for had been implemented. We looked for a single technique or a
detecting spam email are widely used. One is sender based combination of these technique that would give us the best
spam detection and the other method is content based spam result. Frequency distributor was used to order the words based
detection which will consider only the content of an email. This on the number of times they appear in the pre-classified emails.
paper talks about the content based spam detection. Two different lists were used to store the most frequent words
from important emails and spam emails. 2650 words from each
III. TECHNIQUES FOR RETRIEVING RELEVANT of these lists (to limit the number of features that the classifier
INFORMATION needs to process) were taken and shuffled into a list called
This paper discusses some techniques to eliminate irrelevant word_features.
data from emails which are stemming, lemmatizing and
stopwords removal. These techniques can be tried together in 2 Naive Bayes Classification
eight different combinations and all of them have been This Bayesian Classification is used as a probabilistic
experimented with. NLTK, one of the dominant platforms for learning method and every feature of this algorithm being
creating Python programs, is intended to support research in classified is independent of the value of any other feature.
Natural Language or closely related areas. It has various text Bayes theorem [11] provides a way of calculating posterior
processing libraries for classification, tokenization, stemming, probability P(c|x) from P(c), P(x) and P(x|c). Let’s look at the
tagging, parsing, semantic reasoning and wrappers for equation below:
industrial-strength NLP libraries, many of which have been
ܲሺ‫ݔ‬ȁܲሺܲሺ
used in this research. Stemming is the technique of decreasing ܲሺܲȁ‫ݔ‬ሺ ሺܲሺ ‫ ݔ‬ሺ ǥ ǥ ǥ ሺͶǤͳሺ
deviating or derived words to their base form. For grammatical ሺ ሺ
reasons, documents are going to use different forms of a word,
such as meet, meets, and meeting. In many situations, it is
useful for a search for one of these words to return documents Above,
that contain another word in the set. Using stemming on the P(c|x) is the posterior probability of class (c, target) given
above strings, we will get meet as the base form [8]. Stemming predictor (x, attributes).
chops off the ends of words. Algorithms for stemming have P(c) is the prior probability of class.
been studied in computer science since the 1960s. The most P(x|c) is the likelihood which is the probability of predictor
common and effective algorithm for stemming English is given class.
Porter's algorithm. Porter Stemmer [9] has been imported from P(x) is the prior probability of predictor.
NLTK for stemming purpose. Lemmatization is the process of
converting the words of a sentence to its dictionary form. For 2.1 Spam detection using Naive Bayes Classifier
example, given the words amusement, amusing, and amused,
the lemma for each and all would be ‘amuse’. This aims to The goal is to build a classifier that will automatically tag
remove inflectional endings and to return base or dictionary new emails with appropriate category labels. Now the
form of a word. This process involves linguistic approach, such classifier has a list of documents- emails labeled with the
as morphological analysis through regular relations compiled in appropriate categories. The first step in creating a classifier
finite-state transducers. Stop words is a set of commonly used is deciding what features of the input are relevant, and how
words in any language which are excluded out before or after to encode those features. [12] So, a feature extractor for
processing of natural language data which, in this case, is text. documents was defined so that the classifier knows which
The main reason why stop words are essential to any program aspects of the data it should pay attention to. The duplicate
is that, when we remove the words that are very commonly used words from those emails were removed. This made the
in a given language, we can focus on the important words checking faster. Now for every word in word_features, if that
instead. For removing stop words from a email in dataset, we word existed in a given email, it was tagged with the category
searched them in NLTK toolkit’s given list and the result (important or spam) of that email. Thus, words were found
obtained was very accurate. that were labeled as ‘important’ and ‘spam’. These
word:label pairs were used as featureset for Naive Bayes
IV. SYSTEM IMPLEMENTATION Classifier. At this point, there are words in featureset that are
labeled as both important and spam. As feature extractor was
1) Pre-Processing Dataset defined earlier, it can be used to train the classifier to label
5500 emails- 1500 important, 4000 spam were retrieved from new emails. Ninety percent of the featureset was used as
Enron email dataset [10]. These emails were stored into a train_set while the remaining ten percent was used as
python dictionary named documents. For every email, some test_set. To check how reliable the resulting classifier was,
methods on that email were run until the end of the dictionary its accuracy was computed on the test_set. Thus most
reached. At first, email multipart problem was dealt with. informative features were found. Apparently in this email
Current payload is a list of Message objects if there is multipart, dataset, an email that mentions "investment” is almost 19.1
and a string otherwise. HTML was stripped off the email. Then, times more likely to be spam than important, while an email
only the email texts were left. Words were tokenized from these that mentions "tomorrow" is about 17.6 times more likely to
5500 emails. At this point, various combinations of NLP be important. [Fig-4.1] Therefore, even though most words
483
could be labeled as both important and spam in the early observe S. Y = {Y1, Y2, Y3 ....Yt} is a sequence of emissions,
stage, this ratio helps to determine the ultimate label of any we observe Y. Y2 is conditionally independent of everything
such word. If the ratio is 1, only in that case a word is else given S2. S4 is conditionally independent of everything else
considered to be both important and spam. Thus word:label given S3. Probability of being in a particular state at step i is
pairs were created where the most informative features known once we know what state we were in at step i-1.
(words that appeared most frequently in the email dataset) Probability of seeing a particular emission at step i is known
were labeled as important or spam. The spam words were once we know what state we were in at step i. The joint
extracted and stored in a list called spamwords, similarly the distribution of a sequence of states and observations can be
factored in the following way [14]
ሺ
ܲሺܲሺǣሺǡ ܲሺǣሺሺ ሺ ܲሺܲሺሺܲሺܲሺȁܲሺሺ ሺ ܲሺܲሺȁܲሺሺሺሺܲሺܲሺȁܲሺሺ ǥ ǥ ǥ

ሺͶǤͳሺሺ ሺሺ
ሺ
To fully populate the HMM, we need:
Start probabilities – probability of the first email being spam or

important when the program is initiated. In equation (3.17),
Fig 4.1: Most informative features (Naive Bayes)
ሺሺ‫ݐ‬ሺ ሺሺ ሺሺ ሺሺȁ ሺሺሺሺ ሺሺሺ Ǥ‫ݐݕ‬
ሺሺሺ
ሺ
ሺሺ
ሺሺ
̴‫ݐ‬ሺ‫ݐ‬ሺ
important words were stored in another list called
2. Emission probabilities– the chances of each observation
importantwords. Then 200 pre-classified emails from dataset occurring in a state. It represents how likely a word is to be an
were used for testing- 100 important emails and 100 spam important word or a spam word for each state. For instance, the
emails. The emails were split into words and checked how
word ‘discount’ has a 10% chance to appear in an important
many of those words existed in importantwords and how
email while the same word has a 60% chance to appear in a
many of them existed in spamwords. If there were more spam email. In equation (3.17),
importantwords, the email was classified as an important
email. If importantwords and spamwords were equal, the
ሺሺ‫ ݐ‬ሺሺ ሺሺሺሺȁሺሺሺ ‫ݐݕ‬ሺሺሺሺሺሺሺ̴ሺሺሺሺሺ
email category could not be determined. Otherwise, the email
was classified as a spam email. The accuracy was calculated 3. Transition probabilities– It represents the change of the state
based on the number of important emails and spam emails in the underlying Markov chain. For instance, if a word is spam,
that were classified correctly by our algorithm. Finally transition probability dictates how likely the next word is going
various combinations of NLP techniques- stopwords to be spam as well. In equation (3.17),
removing, stemming, lemmatizing were used on Naïve
Bayes Classifier to inspect the differences in accuracy as well ሺȁ ሺሺሺሺሺሺሺ ሺሺ‫ ݐ‬ሺሺ ሺ
as to find the best method among them. Among all these ‫ݐݕ‬ሺሺሺሺሺሺሺሺሺ̴ሺሺሺ‫ݐ‬ሺሺሺሺሺ‫ݐ‬
combinations, ‘removing stopwords’ and lemmatizing used 3.1 Spam detection using Hidden Markov Model
together gave the most accuracy which was better than
previously obtained accuracy. The goal is same here- to build a classifier that will
automatically tag new emails with appropriate category labels.
3 Hidden Markov Model Two states were used- important and spam. The list
HMM is a tool for representing probability distributions over word_features were used as observations. And, start
sequence of observations. The HMM assumes that the probability was set as {‘important’: 0.5, ‘spam’: 0.5}.
observation at time t was generated by some process whose Each word was searched for in both categories and its
state St is hidden from the observer. It also assumes that the state occurrence in either category was counted separately. Their
of this hidden process satisfies the markov property, which is, probability of appearing in spam emails or important emails
given the value of St-1, the current state St is independent of all was used to create the emission probability set.
the states prior to t-1. [13] Graphically we can explain it as In order to determine transition probability, for each word from
shown in figure 4.1. spam and important emails, the probability of the next word
being spam or important was recorded.
HMM was used from sklearn module in python and the start
probability, transition probability and emission probability
were set. This way, important words and spam words could be
identified. Then, like in naive Bayes classifier, 200 pre-
classified emails were taken for testing- 100 important emails
Fig 4.1: Trellis Diagram and 100 spam emails. The emails were split into words and
those were used as observations in HMM. If there were more
The graph shows the dependencies between the variable of the important words, the email was classified as an important email.
model. S = {S1, S2, S3 .... St} is a sequence of states, we do not If important words and spam words were equal, we could not
484
determine the email category. Otherwise, the email was
classified as a spam email. HMM algorithm was also run with
eight different combinations of NLP techniques like Naive
Bayes algorithm. This time a different result was obtained.
HMM with stemming outperformed all the other seven variants.
After that, the best result of Naive Bayes algorithm was
compared with the best result of HMM. Overall HMM was
more successful in identifying spam and important emails from
the test dataset.
4 RESULTS AND ANALYSIS
Process Trained Test Correctly Incorrectly Indeterminate Accuracy

data data Classified Classified instances
Instances instances
Impor- Spam Impor- Spam Impor- Spam
-tant -tant -tant
Basic
5500 200 68 69 23 15 9 16 78.28
Stop
5500 200 83 57 13 25 4 18 78.65
Words
Stemming
5500 200 85 52 7 40 8 8 74.46 Fig 5.1: Accuracy Comparisons (Naive Bayes)
Lemmatizing 5500 200 72 58 16 29 12 13 74.29
Figure 5.1 shows accuracy comparison among 8 different
Stop Words 5500 200 74 58 14 30 12 12 75.00 combinational approach to Naive Bayes and it shows that using
+
Stemming
stop words and lemmatizing together gives the best accuracy
Stop Words
5500 200 70 67 16 20 14 13 79.19 result which is 79.19%.
+
Lemmatizing Process Accuracy Precision Recall F-Measure
Lemmatizing 5500 200 85 52 6 41 9 7 74.46
+ Positive Negative Positive Negative Positive Negative
Stemming
Stop Words+ 5500 200 72 59 15 29 13 12 74.86
Basic 78.28 0.82 0.75 0.75 0.82 0.78 0.78
Lemmatizing+
Stemming
Stop
78.65 0.77 0.81 0.86 0.7 0.81 0.75
Words
Table 5.1: Evaluation on Test Set (Naive Bayes in Different Stemming

74.46 0.68 0.88 0.92 0.57 0.78 0.69
Processes)
Lemmatizing 74.29 0.71 0.78 0.82 0.67 0.76 0.72
Table 5.1 shows that using basic Naive Bayes approach out Stop Words 75.00 0.71 0.81 0.84 0.66 0.77 0.73
of 100 important emails, 68 instances were classified correctly, +
Stemming
23 instances were classified incorrectly and 9 instances could
not be determined. And out of 100 spams 69 instances were Stop Words 79.19 0.78 0.81 0.81 0.77 0.8 0.79
+
classified correctly, 15 instances were classified incorrectly and Lemmatizing
16 instances could not be determined. The total accuracy Lemmatizing 74.46 0.67 0.9 0.93 0.56 0.78 0.69
achieved was 78.28%. +
Stemming
Then again using only stop words, out of 100 important emails
Stop Words 74.86 0.71 0.8 0.83 0.67 0.77 0.73
+
83 instances were classified correctly, 13 instances were Lemmatizing
classified incorrectly and 4 instances could not be determined. +

Stemming
And out of 100 spams 57 instances were classified correctly, 25
instances were classified incorrectly and 18 instances could not Table 5.2: Precision, Recall and F-Measure of Test Set (Naive
be determined. The total accuracy achieved was 78.65%. Bayes in Different Processes)
Table 5.2 describes the detail result of precision, recall and
F-measure[12] of the test set using 8 combinations of different
Using both stop words and lemmatizing 70 instances were
approach to Naive Bayes Classification. The results are
classified correctly; 16 instances were classified incorrectly and
measured for both positive and negative perspective. Table 5.2,
14 instances could not be determined. And out of 100 spams 67 shows that among all 8 approaches, basic naive Bayes gives
instances were classified correctly, 20 instances were classified highest positive precision 0.82 while Lemmatizing and
incorrectly and 13 instances could not be determined. The total Stemming used together gives lowest positive precision 0.67.
accuracy achieved was 79.19% which gives us the best result in Then Lemmatizing and Stemming together gives the highest
this comparison. positive recall 0.93 and basic naive Bayes gives the lowest
positive recall 0.75. Stop Words gives the highest positive F-
485
measure 0.81 and Lemmatizing gives the lowest positive F- Figure 5.16 shows accuracy comparison among 8 different
measure 0.76. We also see that, Lemmatizing and Stemming combinational approach to Hidden Markov Model and it shows
together gives the highest negative precision 0.90 and basic that using only stemming gives the best accuracy result which
naive Bayes gives the lowest negative precision 0.75. Then is 91.28%.
basic naive Bayes gives highest negative recall 0.82 and
Process Accuracy Precision Recall F-Measure
Lemmatizing and Stemming together gives lowest negative
recall 0.56. Lastly, Stopwords Removal and Lemmatizing
together gives the highest negative F-measure 0.79 and both Positive Negative Positive Negative Positive Negative
Stemming alone and Lemmatizing and Stemming together
gives the lowest negative F-measure 0.69. Basic
85.19 0.88 0.83 0.82 0.88 0.85 0.85
Process Trained Test Correctly Incorrectly Indeterminate Accuracy
Stop
Data data Classified Classified instances (Mail
Words
84.04 0.82 0.87 0.89 0.79 0.85 0.83
instances instances Classifi--
cation)
Stemming
Impor-
-tant
Spam Impor-
-tant
Spam Impor-
-tant
Spam 91.28 0.93 0.90 0.90 0.93 0.91 0.91
Basic
5500 200 80 81 17 11 3 8 85.19 Lemmatizing 83.33 0.83 0.84 0.85 0.82 0.84 0.83
Stop
Words 5500 200 87 71 11 19 2 10 84.04
Stemming Stop Words 85.05 0.81 0.91 0.93 0.77 0.86 0.84
5500 200 87 91 10 7 3 2 91.28 +
Lemmatizing Stemming
5500 200 83 77 15 17 2 6 83.33 Stop Words
+
81.05 0.79 0.84 0.87 0.75 0.83 0.79
Stop Words
+
5500 200 91 74 7 22 2 4 85.05 Lemmatizing
Stemming Lemmatizing
Stop Words +
91.15 0.92 0.90 0.9 0.93 0.91 0.91
+
5500 200 86 68 13 23 1 9 81.05 Stemming
Lemmatizing Stop Words
Lemmatizing +
84.46 0.80 0.91 0.93 0.76 0.86 0.83
+
5500 200 86 89 10 7 4 4 91.15
Lemmatizing
Stemming
+
Stop Words Stemming
+
5500 200 91 72 7 23 2 5 84.46
Lemmatizing
+ Table 5.4: Precision, Recall and F-Measure of Test Set (HMM in
Stemming
Different Processes)
Table 5.3: Evaluation on Test Set (Hidden Markov Model in Different
Processes)
After running both Naive Bayes and HMM algorithm in 8
Evaluating Test Set for HMM important emails were used and
combinations of 3 different processes along with basic
out of 100 important emails 91 instances were classified
approach we find different classification accuracy for different
correctly, 7 instances were classified incorrectly and 2 instances
could not be determined which gives us the best accuracy of
91.28%. Process Accuracy
(Mail Classification)
Naive Bayes Algorithm HMM Algorithm
Basic 78.28 85.19
Stop 78.65 84.04

Words
Stemming 74.46 91.28
Lemmatizing 74.29 83.33
Stop Words 75.00 85.05

+
Stemming
+
Lemmatizing
Lemmatizing 74.46 91.15
+
Stemming
+
Lemmatizing
+
Stemming
Table 5.5: Accuracy Comparison between Naive Bayes and HMM
Algorithm.
Table 5.5, Fig 5.3, 5.4 shows that stemming in HMM gives the
most accuracy
Fig 5.2: Accuracy Comparisons (Hidden Markov Model)

486
altercation it can work to classify texts in any number of
categories.
5 REFERENCES
[1] Aery, M., & Chakravarthy, S. (2005). eMailSift: eMail
classification based on structure and content. Data Mining,
Fifth IEEE Int., 2005. IEEE. doi:10.1109/ICDM.2005.58
[2] Islam, R., & Zhou, W. (2007). Email Categorization
Using Multi-stage Classification Technique. Parallel and
Distributed Computing, Applications and Technologies, 2007.
PDCAT '07. Eighth International Conference on, 2007. IEEE.
doi:10.1109/PDCAT.2007.71
[3] Klimt, B., & Yang, Y. (2004). The Enron Corpus: A
New Dataset for Email Classification Research. Boulicaut JF.,
Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning:
ECML 2004. ECML 2004. Springer, Berlin, Heidelberg.
Lecture Notes in Computer Science, 3201.
[4] Bhat, V. H., Malkani, V. R., Shenoy, P. D., Venugopal,
K. R., & Patnaik, L. M. (2011). Classification of email using
BeaKS: Behavior and keyword stemming. TENCON 2011 -
2011 IEEE Region 10 Conference, Bali, 2011. IEEE. 1139-
1143. doi: 10.1109/TENCON.2011.6129290
[5] Wang, X., & Cloete, I.A.N. (2005). Learning to classify
email: a survey. Proceedings of the international conference on
Fig 5.3: Accuracy Comparison between Naive Bayes and HMM
machine learning and, cybernetics, 2005, 9, 18-21.
Algorithm
[6] Yitagesu1, M. E., & Tijare, M. (2016). Email
Classification using Classification Method. International
Journal of Engineering Trends and Technology (IJETT), 32(3),
142.
[7] Teli, S. P., & Biradar, S. (2014). Effective Email
Classification for Spam and Non-Spam. International Journal
of Advanced Research in Computer Science and Software
Engineering, 4(6).
[8] Naive-Bayes Classification Algorithm [Online].
Accessed September 2016. Retrieved from
http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab4
-NaiveBayes.pdf
[9] Bird, S., Klein, E., & Loper, E. (2009). Natural
Language Processing with Python. O'Reilly Media.
[10] Enron Email Dataset [Online]. Accessed June 2016.
Retrieved from https://www.cs.cmu.edu/~./enron/
[11] Jackson, P., & Moulinier, I. (2002). Natural Language
Processing for Online Applications: Text Retrieval, Extraction
and Categorization. Amsterdam/Philadelphia: John Benjamins
Publishing Company.
[12] Saraiya, S. U., & Desai, N. (2015). Content Based
Fig 5.4: Accuracy Comparison between Naive Bayes and HMM Categorization of E-Mail using Hidden Markov Model
Algorithm Approach. IEEE International Conference on Advances in
Engineering and Technology-ICAET 2014, Tamil-Nadu,
Conclusion Chennai. IEEE. doi: 10.13140/RG.2.1.3070.9602
[13] Gharamani, Z. (2001). An Introduction to Hidden
In summary, we propose a comparative approach to email
Markov Models and Bayesian Networks. Journal of Pattern
classification using Naive Bayes Classifier and HMM. We
Recognition and Artificial Intelligence. 15(1), 9-42.
categorize emails by considering only text part from body of the
[14] Jivani, A. G. (2011). A Comparative Study of Stemming
message. Because we consider relative words and sentences as
Algorithms. International Journal of Computer Technology and
feature. After running the same variants on both the algorithms,
we compared the results and used HMM for classification Applications. 2(6), 1930-1938.
because it gave better accuracy. The structure of our research
has been built in such a way that with proper dataset and minor
487

A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model

Uploaded by

Copyright:

Available Formats

A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model

Uploaded by

Copyright:

Available Formats

Proceedings of the 2017 4th International Conference on Advances in Electrical Engineering (ICAEE),

28-30 September, Dhaka, Bangladesh

A Comparative Approach to Email Classification Using Naive

978-1-5386-0869-2/17/$31.00 ©2017 IEEE 482

ܲሺܲሺǣሺǡ ܲሺǣሺሺ ሺ ܲሺܲሺሺܲሺܲሺȁܲሺሺ ሺ ܲሺܲሺȁܲሺሺሺሺܲሺܲሺȁܲሺሺ ǥ ǥ ǥ

To fully populate the HMM, we need:

Start probabilities – probability of the first email being spam or

4 RESULTS AND ANALYSIS

Process Trained Test Correctly Incorrectly Indeterminate Accuracy

Table 5.1: Evaluation on Test Set (Naive Bayes in Different Stemming

classified incorrectly and 4 instances could not be determined. +

Basic 78.28 85.19

Stop 78.65 84.04

Lemmatizing 74.29 83.33

Stop Words 75.00 85.05

Fig 5.2: Accuracy Comparisons (Hidden Markov Model)

You might also like