0% found this document useful (0 votes)
65 views6 pages

Research Paper Emaildetection

This document discusses email spam detection using machine learning algorithms. It begins with an introduction describing the increasing problem of email spam and how machine learning can help identify spam emails. The document then provides a literature review of previous work applying machine learning methods like Naive Bayes, SVMs, and neural networks for email spam detection. Finally, it discusses the dataset used and some common text preprocessing techniques for machine learning like tokenization and bag-of-words modeling that are relevant for analyzing email content and detecting spam.

Uploaded by

Aditya Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views6 pages

Research Paper Emaildetection

This document discusses email spam detection using machine learning algorithms. It begins with an introduction describing the increasing problem of email spam and how machine learning can help identify spam emails. The document then provides a literature review of previous work applying machine learning methods like Naive Bayes, SVMs, and neural networks for email spam detection. Finally, it discusses the dataset used and some common text preprocessing techniques for machine learning like tokenization and bag-of-words modeling that are relevant for analyzing email content and detecting spam.

Uploaded by

Aditya Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Email Spam Detection Using Machine Learning


Algorithms

Abstract—- Email S pam has become a major problem utmost well-known algorithms applied in these procedures.
nowadays, with Rapid growth of internet users, Email spams is However, rejecting sends essentially dependent on content
also increasing. People are using them for illegal and unethical examination can be a difficult issue in the event of bogus
conducts, phishing and fraud. S ending malicious link through positives. Regularly clients and organizations would not need
spam emails which can harm our system and can also seek in into any legitimate messages to be lost. The boycott approach has
your system. Creating a fake profile and email account is much been probably the soonest technique pursued for the separating
easy for the spammers, they pretend like a genuine person in of spams. The technique is to acknowledge all the sends other
their spam emails, these spammers target those peoples who are than those from the area/electronic mail ids. Expressly
not aware about these frauds. S o, it is needed to Identify those
boycotted. With more up to date areas coming into the
spam mails which are fraud, this project will identify those spam
by using techniques of machine learning, this paper will discuss classification of spamming space names this technique keeps
the machine learning algorithms and apply all these algorithm on an eye on no longer work so well. The white list approach is
our data sets and best algorithm is selected for the email spam the approach of accepting the mails from the domain
detection having best precision and accuracy . names/addresses openly whitelisted and place others in a much
less importance queue, that is delivered most effectively after
Keywords: Machine learning, Naïve Bayes, support vector the sender responds to an affirmation request sent through the
machine-nearest neighbor, random forest, bagging, boosting, “junk mail filtering system”.
neural networks.
Spam and Ham: According to Wikipedia “the use of
electronic messaging systems to send unsolicited bulk
I. INT RODUCT ION messages, especially mass advertis ement, malicious links etc.”
are called as spam. “Unsolicited means that those things which
Email or electronic mail spam refers to the “using of email
you didn’t asked for messages from the sources. So, if you do
to send unsolicited emails or advertising emails to a group of not know about the sender the mail can be spam. People
recipients. Unsolicited emails mean the recipient has not
generally don’t realize they just signed in for those mailers
granted permission for receiving those emails. “The popularity when they download any free services, software or while
of using spam emails is increasing since last decade. Spam has
updating the software. “Ham” this term was given by Spam
become a big misfortune on the internet. Spam is a waste of
Bayes around 2001 and it is defined as “Emails that are not
storage, time and message speed. Automatic email filtering
generally desired and is not considered spam”.
may be the most effective method of detecting spam but
nowadays spammers can easily bypass all these spam filtering
applications easily. Several years ago, mos t of the spam can be
blocked manually coming from certain email addresses.
Machine learning approach will be used for spam detection.
Major approaches adopted closer to junk mail filtering
encompass “text analysis, white and blacklists of domain
names, and community-primarily based techniques”. Text
assessment of contents of mails is an extensively used method
to the spams. Many answers deployable on server and
purchaser aspects are available. Naive Bayes is one of the Fig.1. Classification into Spam and non-spam

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 108

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Machine learning approaches are more efficient, a set of For example if it is tried to search a query like” How to make
training data is used, these samples are the set of email which a veg cheese sandwich”, the search engine will try to search
are pre classified. Machine learning approaches have a lot of the web pages that contains the term “how”, “to” ,”make”, “a”
algorithms that can be used for email filtering. These ,”veg”, “cheese” ,”sandwich”. The search engine tries to find
algorithms include “Naïve Bayes, support vector machines, the web pages that contains the term “how” ,”to”, ”a” than
Neural Networks, K-nearest neighbor, Random Forests etc.” page containing the recipes of veg cheese sandwich because
the terms ” how” ,”to”, “a” are so commonly used in English
II. LITERATURE REVIEW language ,If these three words are removed or stopped and
There is some related work that apply machine learning actually focuses on retrieving pages that contains the keyword
methods in email spam detection, A. Karim, S. Azam, B. ” veg”, “cheese”, “sandwich” – that would give the result of
Shanmugam, K. Kannoorpatti and M. Alazab.[ii] They interest.
describe a focused literature survey of Artificial Intelligence
Revised (AI) and Machine learning methods for email spam 2. Tokenization:
detection. K. Agarwal [3] and T. Kumar. Harisinghaney et al. “Tokenization is the process of splitting a stream of
(2014) [4]and Mohamad & Selamat (2015) [v] have used the manuscript into phrase, symbols, words, or any expressive
“image and textual dataset for the e-mail spam detection with elements named as tokens.” The rundown of token further
the use of various methods. Harisinghaney et al. (2014) [iv] utilized for contribution for additional handling, for example,
have used methods of KNN algorithm, Naïve Bayes, and content mining and parsing. Tokenization is valuable in both
Reverse DBSCAN algorithm with experimentation on dataset. semantics (where it is as content division), and as lexical
For the text recognition, OCR library” [iii] is employed but examination in software engineering and building.
this OCR doesn't perform well. Mohamad & Selamat (2015) It is occasionally hard to define what is intended by the term
[v] uses the feature selection hybrid approach of TF-IDF “word”. As tokenization happens at the word level. Frequently
(Term Frequency Inverse Document Frequency) and Rough a token trusts on modest heuristics, for instance:
pure mathematics. Tokens are parted by whitespaces characters, like “line break”
or “space”, or by “punctuation characters”.
A. Data Set Every single neighboring string of alphabetic characters are a
This model has used email data sets from different online piece of one token; similarly, with numbers.
websites like Kaggle, sklearn and some data sets are created White spaces and punctuations might or might not involve in
by own. A spam email data set from Kaggle is used to train the resulting lists of tokens.
our model and then other email data set is used for getting
result “spam.csv” data set contains 5573 lines and 2 columns 3. Bag of words
and other data sets contains 574,1001,956 lines of email data “Bag of Words (BOW) is a method of extracting features from
set in text format. text documents. Further these features can be uses for training
machine learning algorithms. Bag of Words creates a
III. M ET HODOLOGY vocabulary of all the unique words present in all the document
in the Training dataset.”
A. Data preprocessing:
When the data is considered, always a very large data sets
with large no. of rows and columns will be noted. But it is not B. CLASSIC CLASSIFIERS
always the case the data could be in many forms such as Classification is a form of data analysis that extracts the
Images, Audio and Video files Structured tables etc. models describing important data classes. A classifier or a
Machine doesn’t understand images or video, text data as it is, model is constructed for prediction of class labels for example:
Machine only understand 1s and 0s. “A loan application as risky or safe.”
Steps in Data Preprocessing:
Data cleaning: In this step the work like filling of “missing Data classification is a two-step
values”, “smoothing of noisy data”, “identifying or removing - learning step (construction of classification model.) and
outliers “, and “resolving of inconsistencies is done.”
Data Integration: In this step addition of several databases, - a classification step
information files or information set is performed. 1. NAÏVE BAYES:
Data transformation: Aggregation and normalization is
performed to scale to a specific value Naïve Bayes classifier was used in 1998 for spam recognition.
Data reduction: This section obtains a summary of the dataset The Naïve Bayes classifier algorithm is an algorithm which is
which is very small in size but so far produces the same used for supervised learning. The Bayesian classifier works on
the dependent events and works on the probability of the event
analytical result
which is going to occur in the future that can be detected from
1. Stop words: the same event which occurred previously. Naïve Bayes was
“Stop words are the English words that do not add much made on the Bayes theorem which assumes that features are
meaning to a sentence.” They can be safely ignored without autonomous of each other. Naïve Bayes classifier technique
forgoing the sense of the sentence.

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 109

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

can be used for classifying spam emails as word probability


plays main role here. If there is any word which occurs often in
spam but not in ham, then that email is spam. Naive Bayes
classifier algorithm has become a best technique for email
filtering. For this the model is trained using the Naïve Bayes
filter very well to work effectively. The Naive Bayes always
calculates the probability of each class and the class having the
maximum probability is then chosen as an output. Naïve Bayes
always provide an accurate result. It is used in many fields like
spam filtering.

(1)
Fig.3. Decision Tree Structure
Decision tree Induction:
(2) The building of “decision tree classifiers” doesn’t need “any
domain knowledge or parameter setting that is suitable fo r
2. SUPPORT VECTOR MACHINE examining knowledge. “It handles multidimensional
information. the learning and classification phases of decision
“The Support Vector Machine (SVM) is a popular tree induction are simple and fast. Characteristic choice events
Supervised Learning algorithm, the Support Vector model is are utilized to choose the characteristic that top parcel the tuple
used for classification problems in Machine Learning into particular classes. At the point when choice tree is
techniques. “The Support Vector Machines totally founded on manufactured a significant number of the branches may result
the idea of Decision points. The Main resolution of Support may reflect commotion and anomalies in the preparation
Vector Machine algorithm is to create the line or decision information. tree pruning endeavors to recognize and evacuate
boundary. The Support Vector Machine algorithm gives such branches, with the objective of improving classifier
hyperplane as a output which classifies new samples. In 2- precision on an inconspicuous information.
dimensional space “hyperplane is line dividing a plane into 2
parts where each class is present in one side.” Entropy using the frequency table of one attribute:

(3)

Entropy using the frequency table of two attributes:

(4)

4. K- NEAREST NEIGBOUR
“K-nearest neighbors is a supervised classification algorithm.
This algorithm has some data point and data vector that are
separated into several classes to predict the classification of
Fig.2 Support Vector Machine new sample point.”

3. DECISION TREE K- Nearest neighbor is a LAZY algorithm LAZY algorithm


means it tries to only memorize the process it doesn’t learn by
“Decision tree induction is the learning of decision tree from itself. It doesn’t take its own decision by itself.
class labeled training tuples”. A decision tree is a flow chart
like construction, where. K- Nearest neighbor algorithm classifies new point based on a
similarity measure that can be Euclidian distance.
Internal node or non- leaf node= Test on attribute
The Euclidean distance measure Euclidian distance and
Branch = shows outcome of the test identifies who are its neighbors.
Leaf node= holds a class label dist((x, y), (a, b)) = √(x - a)² + (y - b)² (5)
Top node is called root node.

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 110

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

C. ENSEMBLE LEARNING METHODS 1.4. Select whether you want to “Train”, “Test’ or
“Compare” the models using the dataset.
“Ensemble methods in machine learning is a method that takes 1.4.1. If “Train” is selected, then go to step 1.5.
several base model to produce a predictive model in order to 1.4.2. If “Test” is selected, then go to step 1.6.
decrease. “variance by using bagging bias by using boosting 1.4.3. If “Compare” is selected, then go to step 1.7.
predictions using stacking. Two Types Sequential- here base
classifier are created sequentially Parallel- here base classifiers 1.5. “Train” selected:
are in parallel. 1.5.1. Select which classifier to train using the
1. RANDOM FOREST CLASSIFIER inserted dataset.
1.5.2. Check for duplicates and NAN values.
Random forest classifier is an ensemble tree classifier
consisting of different types of decision trees that are of 1.5.3. Find the values from Hyperparameter Tuning.
different shape and sizes. 1.5.4. Process the text for feature transform.
1.5.5. Train the model
The random sampling of the training data when building a tree. 1.5.6. Save the model and features. Show the results.
A random subgroups of input features when splitting at node in 1.5.7. Select which classifier to test using the inserted
a tree. If you have randomness, the randomization will make dataset.
look the decision tree less corelated so that generalization error 1.5.8. Check for duplicates and NAN values.
(features of the tree should not look same) of ensemble can be 1.5.9. Load the model and features saved in the
improved.
training phase of the model.
2. BAGGING 1.5.10. Using the loaded values for testing the
dataset.
“Bagging classifier is an ensemble classifier that fits base
1.5.11. Show the results
classifiers each on random sub sets of the original data sets and
then combined their individual calculations by voting or by
averaging) to form a final prediction. “Bagging is a mixture of 1.6. “Compare” selected:
bootstrapping and aggregating. 1.6.1. Compare all the classifiers using the inserted
dataset.
Bagging= Bootstrap AGGregatING 1.6.2. Show the results of the classifiers.
Bootstrapping helps to lessening the variance of the classifier
and it also decline the overfitting by just resampling the data A. Implementation
from the training data with same cardinality as in original data
set. High variance is not good for the model. Bagging is very Visual studio code platform is used to implement the model
effective method for limited data, and by just using samples and, in this module, a dataset from “Kaggle” website is used
you are able to get estimate by aggregating the scores . as a training dataset. The inserted dataset is first checked for
duplicates and null values for better performance of the
3. BOOSTING AND ADABOOST CLASSIFIER machine. Then, the dataset is split into 2 sub-datasets; say
“Boosting is a ensemble method that is use to create a strong “train dataset” and “test dataset” in the proportion of 70:30.
classifier using a number of weak classifier. Boosting is Then the “train” and “test” dataset is then passed as
complete by creation a model from a training data sets, then parameters for text-processing. In text-processing, punctuation
create another model that will precise the faults of the first symbols and words that are in the stop words list are removed
model.” [8] In Boosting Model are added till the training set is and returned as clean words. These clean words are then
predicted properly. passed for “Feature Transform”. In feature transform, the
AdaBoost= Adaptive Boosting clean words which are returned from the text-processing are
then used for ‘fit’ and ‘transform’ to create a vocabulary for
AdaBoost is a first fruitful boosting algorithm that was settled the machine. The dataset is also passed for “hyperparameter
for binary classification. The boosting is understood by using tuning” to find optimal values for the classifier to use
AdaBoost. according to the dataset.
After acquiring the values from the “hyperparameter tuning”,
IV. A LGORIT HMS the machine is fitted using those values with a random state.
1.1. Insert the dataset or file for training or testing. The state of the trained model and features are saved for future
1.2. Check the dataset for supported encoding. use for testing unseen data.
1.2.1. If one of the supported encodings, then go to Using classifiers from module sklearn in python, the machines
step 1.4. are trained using the values obtained from above.
1.2.2. If not one of the supported encoding, then go to
step 1.3.
1.3. Change the encoding of the inserted file into one of
the supported encodings. Then try again for reading.

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 111

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

B. FlowChart of the model

Fig.4. Flow Chart of Model

V. RESULT
Our model has been trained using multiple classifiers to check
and compare the results for greater accuracy. Each classifier the name of your paper. In this newly created file, highlight all
will give its evaluated results to the user. After all the of the contents and import your prepared text file. You are now
classifiers return its result to the user; then the user can ready to style your paper; use the scroll down window on the
compare it with other results to see whether the data is “spam” left of the MS Word Formatting toolbar.
or “ham”. Each classifier result will be shown in graphs and
tables for better understanding. The dataset is obtained from
“Kaggle” website for training. The name of the dataset used is
“spam.csv”. To test the trained machine, a different CSV file is
developed with unseen data i.e. data which is not used for the
training of the machine; named “emails.csv”. After the text edit
has been completed, the paper is ready for the template.
Duplicate the template file by using the Save As command, and
use the naming convention prescribed by your conference for

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 112

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

VI. CONCLUSION
With this result, it can be concluded that the Multinomial
Naïve Bayes gives the best outcome but has limitation due to
TABLE I. COMP ARISION TABLE
class-conditional independence which makes the machine to
Classifiers Score 1 Score 2 Score 3 Score 4 misclassify some tuples. Ensemble methods on the other hand
1 Support Vector Classifier 0.81 0.92 0.95 0.92 proven to be useful as they using multiple classifiers for class
prediction. Nowadays, lots of emails are sent and received and
2 K-Nearest Neighbour 0.92 0.88 0.87 0.88
it is difficult as our project is only able to test emails using a
3 Naïve Bayes 0.87 0.98 0.98 0.98 limited amount of corpus. Our project, thus spam detection is
4 Decision T ree 0.94 0.95 0.93 0.95 proficient of filtering mails giving to the content of the email
and not according to the domain names or any other criteria.
5 Random Forest 0.90 0.92 0.92 0.92 Therefore, at this it is an only limited body of the email.
6 AdaBoost Classifier 0.95 0.94 0.95 0.94 There is a wide possibility of improvement in our project. The
subsequent improvements can be done:
7 Bagging Classifier 0.94 0.94 0.95 0.94
“Filtering of spams can be done on the basis of the trusted and
verified domain names.”
a. score 1: using default parameters “The spam email classification is very significant in
b. score 2: using hy perparameter tuning categorizing e-mails and to distinct e-mails that are spam or
c. score 3: using stemmer and hy perparameter tuning
non-spam.”
d. score 4: using length, stemmer and hy perparameter tuning
“This method can be used by the big body to differentiate
decent mails that are only the emails they wish to obtain.”

REFERENCES
1. Suryawanshi, Shubhangi & Goswami, Anurag & Patil, Pramod.
(2019). Email Spam Detection: An Empirical Comparative Study of
Different ML and Ensemble Classifiers. 69-74.
10.1109/IACC48062.2019.8971582.
2. Karim, A., Azam, S., Shanmugam, B., Krishnan, K., & Alazab,
M. (2019). A Comprehensive Survey for Intelligent Spam Email
Detection. IEEE Access, 7, 168261-168295.
[08907831]. https://doi.org/10.1109/ACCESS.2019.2954791
3. K. Agarwal and T . Kumar, "Email Spam Detection Using
Integrated Approach of Naïve Bayes and Particle Swarm
Optimization," 2018 Second International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2018,
pp. 685-690.
4. Harisinghaney, Anirudh, Aman Dixit, Saurabh Gupta, and Anuja
Arora. "T ext and image-based spam email classification using
KNN, Naïve Bayes and Reverse DBSCAN algorithm." In
Optimization, Reliabilty, and Information T echnology (ICROIT ),
Fig.5 Comparison of all algorithms 2014 International Conference on, pp.153 -155. IEEE, 2014
5. Mohamad, Masurah, and Ali Selamat. "An evaluation on t he
efficiency of hybrid feature selection in spam email classification."
In Computer, Communications, and Control T echnology (I4CT ),
2015 International Conference on, pp. 227 -231. IEEE, 2015
6. Shradhanjali, Prof. T oran Verma “E-Mail Spam Detection and
Classification Using SVM and Feature Extraction”in International
Jouranl Of Advance Reasearch, Ideas and Innovation In
T echnology,2017 ISSN: 2454-132X Impact factor: 4.295
7. W.A, Awad & S.M, ELseuofi. (2011). Machine Learning Methods
for Spam E-Mail Classification. International Journal of Computer
Science & Information Technology. 3. 10.5121/ijcsit.2011.3112.
8. A. K. Ameen and B. Kaya, "Spam detection in online social
networks by deep learning," 2018 International Conference on
Artificial Intelligence and Data Processing (IDAP), Malatya,
T urkey, 2018, pp. 1-4.
9. Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T . (2019). Root
Cause Detection with an Ensemble Machine Learning Approach in
the Multivariate Manufacturing Process.
10. T asnim Kabir, Abida Sanjana Shemonti, Atif Hasan Rahman.
"Notice of Violation of IEEE Publication Principles: Species
Identification Using Partial DNA Sequence: A Machine Learning
Fig.6. Comparison Graph Approach”, 2018 IEEE 18th International Conference on
Bioinformatics and Bioengineering (BIBE), 2018.

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 113

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like