Machine Learning Paper-2
Machine Learning Paper-2
Machine Learning Paper-2
ABSTRACT Past few years have seen increase in the number of In fact, in countries like Russia, real users are being
spam emails and messages. Legal, economic and technical simulated and emulated by botnets and zombie PC systems
PHDVXUHVFDQEHXVHGWRWDFNOHVSDPVPV¶Vnowadays. A key role when sending a message through free messaging service . So,
is being played by Bayesian filters in stopping this problem. In we can clearly assume that there has been a drastic decrease
this paper, we analyzed and studied the relative strengths of in the cost of SMS spam. In other words, mobile spam have
various machine learning algorithms in order to detect spam a good return on investment (ROI). There has been a constant
messages which are sent on mobile devices. We have acquired discussion on this topic, and people have come up with
the data from on open public dataset and prepared two datasets concrete technical measures in order to tackle this problem.
for our testing and validation purposes. Accuracy in detecting Most of these measures and practices can be used in order to
spam messages was the first priority in ranking these deal with SMS spam messages. Bayesian filters have been
algorithms. Our results clearly demonstrate that different most prominent ones and most widely accepted. These
machine learning algorithms under different features tend to methods discriminate and classify legitimate messages from
perform differently in classifying spam messages.
the pile of normal and SMS spam messages with a clever use
Keywords: Natural Language Processing (NLP), spam
of machine learning algorithms. This paper is based upon
detection, machine learning.
testing the results obtained with Bayesian filtering and
compare those results with other machine learning
1.INTRODUCTION algorithms in order to find the best alternative that would be
useful in classifying SMS spam messages. We, also
Since the year 2001, mobile spam messages have been a proposed some feature sets that could be utilized by different
consistent problem in Far East countries. In fact, it was quite machine learning algorithms in order to classify those text
amazing to note that the amount of spam messages exceeded messages and we present our findings, with accuracy of the
the number of spam emails. In some countries SMS spam algorithm being the deciding factor in choosing the best
problem was so disturbing, that legal action was taken in method.
order to stop this practice. Two acts were filed by Japanese
government in the year 2002 for definition and penalization 2. TECHNICAL MEASSURES AGAINST SPAM EMAIL
of email and mobile abuse. With the effect of these laws, a
huge relief has been obtained in the masses regarding spam
message abuse and eradicated some of the problems in the Activities like phishing, selling drugs and advertising
society. Experts also feel that technical and legal measures pornographic sites use on an average of millions of spam
need to be taken in order to control this widespread email messages [1]. There has been a drastic disturbing
disturbing abuse of mobile spam messages. Since the effort increase in the growth of this problem in the recent years
and expense of sending a spam message is more than the cost which has sparked a keen motivation in scientists all over the
of sending spam emails, this practice is still considered a world to take a keen interest in this problem and come up
minor problem in Western countries. But in Europe, where a with good solutions to tackle this problem.
user sends an average of 10 texts a day and nearly most of
the population owning a smartphone, spam message is the 2.1 Spam Email Filtering
fashion.
Following ones are the best techniques nowadays
to reduce the amount of spam messages.
White and black listing. The messages coming from people
that are considered spammers are blocked. This is made
978-1-5386-0627-8/17/$31.00 2017
c IEEE 28
possible by blacklisting those people. However, this does not
disrupts the normal flow of messages as the other persons are 3. EXPERIMENTS AND OBSERVATIONS
still whitelisted, and their messages received are acted upon
normally by a mobile device. We performed some tests and analysis with various machine
learning algorithms with different feature sets obtained and
Address management. This phenomenon rests upon a system, finally, we have represented our findings by ranking
Which contains automatic addresses that are produced by different machine learning algorithms in order of their
machines that are generated at some particular time intervals. accuracy so as to determine the best possible machine
[3]. learning algorithm which will be able to classify SMS spam
messages with highest accuracy.
Collaborative filtering. This type of algorithm is use in
recommendation systems where when one user marks one 3.1 Corpus
WKLQJDV³VRPHWKLQJ´WKHRWKHUXVHUJHWVVRPHLQIRUPDWLRQ
based upon the tagging of first user. This can be applied here The database which was used for this research is currently
DOVR LI RQH SHUVRQ PDUNV PHVVDJH DV ³VSDP´ WKHQ RWKHU publicly available on Kaggle under the name "SMS spam
person can be aware while opening that message. [4]. collection database". Kaggle is a website which is an
transparent repository for public datasets and this dataset was
Digital signatures. If there is no digital signature present in provided to the website by the UCL machine learning
the message, it is safe to assume that it is a spam message. repository and other public open datasets.
Service provider can be used to obtain the digital signatures
of the sender [4]. 3.2 Database Description
Content-based filtering [6]. Here, a careful approach is taken. Our database is a freely available dataset of 5,574 classified
A well annotated dataset is maintained of spam messages and VKRUWPHVVDJHV606¶VWKDWDUHUHDODQGQRQ-encoded. The
analysis is done by observing some patterns in those language in these messages is English and they are labelled
messages. Then machine learning models are built on top of as ham or spam. Firstly, we tokenized the text and obtained
those annotated classified messages. the IG matrix. We also removed all the stop words from the
text which might impact the modelling procedure, as the stop
ZRUGVGRQ¶WKDYHDQ\LPSDFWRQWKHPRGHO
2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN) 29
3.3 Feature Selection 3.4 Proposed Model
30 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN)
3.5 Experimental Results 4. Conclusion
We fed following information to all the 3 algorithms: raw From the following set of experiment, we could say that
data, information gain matrix and the message length of the naïve Bayes outperforms random forest algorithm and
message. After the parameter optimization using a cross logistic regression algorithm when it comes to classifying
validation technique with 5-folds we obtained following SMS spam. The naïve Bayes algorithm achieved a high
results and accuracies with our models (fig.3). We gave full accuracy of 98.445 % just with the information gain matrix
choice to the algorithm to pick its own features and it could and easily classified the text as either spam/non-spam.
either discard both, and just pick the raw text, or go with
HLWKHURIWKHPRUJRZLWKERWK$OOWKHDOJRULWKPVGLGQ¶WSLFN $OWKRXJK ZH KDYHQ¶W SURYLGHG GHWDLOHG DQDO\VLV RQ WKH
all the features as the naïve Bayes algorithm performed best running times of these algorithms but we checked that naïve
with a marginal difference than its variant with features and Bayes also has running time which is the least among these
picked just the information gain matrix, while the other two algorithms. We can also say that Random forest algorithm
algorithms went for all the features. using both the features also performed excellently and could
very well be a good alternative. We have experimentally
verified that Naïve Bayes algorithm outperforms Random
forest and Logistic regression algorithm.
REFERENCE
[1] Christine E. Drakeand Jonathan J. Oliver, Eugene J. [5] Tompkins T., Handley D. Giving e-mail back to the
Koontz. Anatomy of a Phishing Email. Proceedings of the users: Using digital signatures to solve the spam
First Conference on Email and Anti-spam (CEAS), 2004. problem. First Monday, 8(9), September 2003.
[2] Dwork, C., Goldberg A., Naor M.. On memory- [6] Yang, Y. 1999. An evaluation of statistical approaches
bound functions for fighting spam. In Proceedings of the to text categorization. Information Retrieval, 1(1/2):69-90.
23rd Annual International Cryptology Conference
(CRYPTO 2003), August 2003. [7] Yang, Y., J.O. Pedersen. 1997. A comparative study
on feature selection in text categorization. En Proceedings
[3] R.J. Hall. How to avoid unwanted email. of the 14th International Conference on Machine Learning.
Communications of the ACM, March 1998.
2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN) 31