Machine Learning Paper-2

The document discusses SMS spam detection and compares various machine learning algorithms for this task. It analyzes different algorithms using two datasets to classify SMS messages as spam or ham. The results show that different algorithms perform differently depending on the features used. Specifically, it finds that message length and most occurring tokens are important features, and that Bayesian filtering performs well compared to other algorithms for SMS spam detection.

Uploaded by

Bilal Akmal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Machine Learning Paper-2

Uploaded by

Bilal Akmal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

SMS spam detection and comparison of

various machine learning algorithms

Paras Sethi Vaibhav Bhandari Bhavna Kohli

B.tech B.tech B.tech
Computer Science and Computer Science and Electronics and
Engineering Engineering Communication
HMR Institute of technology HMR Institute of technology HMR Institute of technology
and and and
Management New Delhi, India Management New Delhi, India Management New Delhi, India
E-mail: E-mail: E-mail:
Parassethi234@gmail.com Vaibhav.vb24@gmail.com Bhav2411@gmail.com

ABSTRACT Past few years have seen increase in the number of In fact, in countries like Russia, real users are being
spam emails and messages. Legal, economic and technical simulated and emulated by botnets and zombie PC systems
PHDVXUHVFDQEHXVHGWRWDFNOHVSDPVPV¶Vnowadays. A key role when sending a message through free messaging service . So,
is being played by Bayesian filters in stopping this problem. In we can clearly assume that there has been a drastic decrease
this paper, we analyzed and studied the relative strengths of in the cost of SMS spam. In other words, mobile spam have
various machine learning algorithms in order to detect spam a good return on investment (ROI). There has been a constant
messages which are sent on mobile devices. We have acquired discussion on this topic, and people have come up with
the data from on open public dataset and prepared two datasets concrete technical measures in order to tackle this problem.
for our testing and validation purposes. Accuracy in detecting Most of these measures and practices can be used in order to
spam messages was the first priority in ranking these deal with SMS spam messages. Bayesian filters have been
algorithms. Our results clearly demonstrate that different most prominent ones and most widely accepted. These
machine learning algorithms under different features tend to methods discriminate and classify legitimate messages from
perform differently in classifying spam messages.
the pile of normal and SMS spam messages with a clever use
Keywords: Natural Language Processing (NLP), spam
of machine learning algorithms. This paper is based upon
detection, machine learning.
testing the results obtained with Bayesian filtering and
compare those results with other machine learning
1.INTRODUCTION algorithms in order to find the best alternative that would be
useful in classifying SMS spam messages. We, also
Since the year 2001, mobile spam messages have been a proposed some feature sets that could be utilized by different
consistent problem in Far East countries. In fact, it was quite machine learning algorithms in order to classify those text
amazing to note that the amount of spam messages exceeded messages and we present our findings, with accuracy of the
the number of spam emails. In some countries SMS spam algorithm being the deciding factor in choosing the best
problem was so disturbing, that legal action was taken in method.
order to stop this practice. Two acts were filed by Japanese
government in the year 2002 for definition and penalization 2. TECHNICAL MEASSURES AGAINST SPAM EMAIL
of email and mobile abuse. With the effect of these laws, a
huge relief has been obtained in the masses regarding spam
message abuse and eradicated some of the problems in the Activities like phishing, selling drugs and advertising
society. Experts also feel that technical and legal measures pornographic sites use on an average of millions of spam
need to be taken in order to control this widespread email messages [1]. There has been a drastic disturbing
disturbing abuse of mobile spam messages. Since the effort increase in the growth of this problem in the recent years
and expense of sending a spam message is more than the cost which has sparked a keen motivation in scientists all over the
of sending spam emails, this practice is still considered a world to take a keen interest in this problem and come up
minor problem in Western countries. But in Europe, where a with good solutions to tackle this problem.
user sends an average of 10 texts a day and nearly most of
the population owning a smartphone, spam message is the 2.1 Spam Email Filtering
fashion.
Following ones are the best techniques nowadays
to reduce the amount of spam messages.
White and black listing. The messages coming from people
that are considered spammers are blocked. This is made

978-1-5386-0627-8/17/$31.00 2017
c IEEE 28
possible by blacklisting those people. However, this does not
disrupts the normal flow of messages as the other persons are 3. EXPERIMENTS AND OBSERVATIONS
still whitelisted, and their messages received are acted upon
normally by a mobile device. We performed some tests and analysis with various machine
learning algorithms with different feature sets obtained and
Address management. This phenomenon rests upon a system, finally, we have represented our findings by ranking
Which contains automatic addresses that are produced by different machine learning algorithms in order of their
machines that are generated at some particular time intervals. accuracy so as to determine the best possible machine
[3]. learning algorithm which will be able to classify SMS spam
messages with highest accuracy.
Collaborative filtering. This type of algorithm is use in
recommendation systems where when one user marks one 3.1 Corpus
WKLQJDV³VRPHWKLQJ´WKHRWKHUXVHUJHWVVRPHLQIRUPDWLRQ
based upon the tagging of first user. This can be applied here The database which was used for this research is currently
DOVR LI RQH SHUVRQ PDUNV PHVVDJH DV ³VSDP´ WKHQ RWKHU publicly available on Kaggle under the name "SMS spam
person can be aware while opening that message. [4]. collection database". Kaggle is a website which is an
transparent repository for public datasets and this dataset was
Digital signatures. If there is no digital signature present in provided to the website by the UCL machine learning
the message, it is safe to assume that it is a spam message. repository and other public open datasets.
Service provider can be used to obtain the digital signatures
of the sender [4]. 3.2 Database Description
Content-based filtering [6]. Here, a careful approach is taken. Our database is a freely available dataset of 5,574 classified
A well annotated dataset is maintained of spam messages and VKRUWPHVVDJHV606¶VWKDWDUHUHDODQGQRQ-encoded. The
analysis is done by observing some patterns in those language in these messages is English and they are labelled
messages. Then machine learning models are built on top of as ham or spam. Firstly, we tokenized the text and obtained
those annotated classified messages. the IG matrix. We also removed all the stop words from the
text which might impact the modelling procedure, as the stop
ZRUGVGRQ¶WKDYHDQ\LPSDFWRQWKHPRGHO

Message Length Feature Maximum Occurring Token

2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN) 29
3.3 Feature Selection 3.4 Proposed Model

In a data mining inspired approach, we decided to feed the

algorithm basically two most important features: the
message length and the count vectorizer matrix. We use
length of the message as attribute quality metric. During our
exploratory analysis we figured out that messages with
spams have different message lengths as compared to other
non-spam messages. This is evident from the plot
shown(fig.1) , where we can easily see that the spam
messages tend to have a mean length of about 176 characters
while the ham messages tend to have a message length of
about just 55 characters .We use Information Gain (IG) [6,
7](fig.2) as attribute quality metric. The experience in
learning-based text classification is that IG can reduce
substantially the number of attributes, without no loss (or
even some improvement) of accuracy.

3.4 Machine Learning Algorithms

For our experiments we have used the following

algorithms:

Naive Bayes (NB): This is the most immediate and simple

Bayesian learning method. It is based on the Bayes theorem
and they learn parameters by observing each feature
independently, regardless of all the other features in the data
set and derive statistics from each class for each class.They
also provide generalization performance that is slightly more
worse than that of linear classifiers. Detailed description of Phases

Random Forests: A random forest is just and ensemble of

various decision trees grouped together,which is used in
order to remove the provlem of overfitting in decision 2) Feature Creation: This is the portion where we create
trees.The main idea behind this algorithm is that each tree is some extra meaningful information about the database which
capable of producing predictions that are sligghlty different can be used to easily help us classify the text as either
from other trees.By producing different results, they tend to spam/non-spam. We were able to craft two features that were
perform differently from each other and in the end , we able to very easily differentiate between spam and other non-
generalize their results by averaging them. spam messages. These were message length and information
gain matrix.
Logistic regression: Logistic regression is one of the most
basic algorithms used for binary classification and is a linear 3) Feature selection: Here, we fed all the above features as
algorithm.The outcome here is decided with the help of a well as the raw data to all the algorithms and we asked the
dichotomous variable(which may possiblty has 2 outcomes). algorithms to select features that were best for its use. The
Instead of selecting parameters that minimize the sum of algorithm was fed all the features and the features selected
squared errors (like in ordinary regression), prediction in depended from one algorithm to another.
logistic regression chooses parameters that maximize the
likelihood of observing the sample values. 4) CV and Comparison of Algorithms: Here, a 5-fold cross
validation technique was used to prevent the algorithms
1) Experimental Database: This step mostly comprised of from overfitting and we wanted to get a very good
selecting our primary dataset and involved applying data precision of the accuracy of the algorithm working
preprocessing techniques that can be used to make the underneath. Also, in the end we compared different
database more understandable and develop some extra algorithms according to their working
information about the dataset.

30 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN)
3.5 Experimental Results 4. Conclusion

We fed following information to all the 3 algorithms: raw From the following set of experiment, we could say that
data, information gain matrix and the message length of the naïve Bayes outperforms random forest algorithm and
message. After the parameter optimization using a cross logistic regression algorithm when it comes to classifying
validation technique with 5-folds we obtained following SMS spam. The naïve Bayes algorithm achieved a high
results and accuracies with our models (fig.3). We gave full accuracy of 98.445 % just with the information gain matrix
choice to the algorithm to pick its own features and it could and easily classified the text as either spam/non-spam.
either discard both, and just pick the raw text, or go with
HLWKHURIWKHPRUJRZLWKERWK$OOWKHDOJRULWKPVGLGQ¶WSLFN $OWKRXJK ZH KDYHQ¶W SURYLGHG GHWDLOHG DQDO\VLV RQ WKH
all the features as the naïve Bayes algorithm performed best running times of these algorithms but we checked that naïve
with a marginal difference than its variant with features and Bayes also has running time which is the least among these
picked just the information gain matrix, while the other two algorithms. We can also say that Random forest algorithm
algorithms went for all the features. using both the features also performed excellently and could
very well be a good alternative. We have experimentally
verified that Naïve Bayes algorithm outperforms Random
forest and Logistic regression algorithm.

Fig.3 Comparative study of different machine learning algorithms

REFERENCE

[1] Christine E. Drakeand Jonathan J. Oliver, Eugene J. [5] Tompkins T., Handley D. Giving e-mail back to the
Koontz. Anatomy of a Phishing Email. Proceedings of the users: Using digital signatures to solve the spam
First Conference on Email and Anti-spam (CEAS), 2004. problem. First Monday, 8(9), September 2003.

[2] Dwork, C., Goldberg A., Naor M.. On memory- [6] Yang, Y. 1999. An evaluation of statistical approaches
bound functions for fighting spam. In Proceedings of the to text categorization. Information Retrieval, 1(1/2):69-90.
23rd Annual International Cryptology Conference
(CRYPTO 2003), August 2003. [7] Yang, Y., J.O. Pedersen. 1997. A comparative study
on feature selection in text categorization. En Proceedings
[3] R.J. Hall. How to avoid unwanted email. of the 14th International Conference on Machine Learning.
Communications of the ACM, March 1998.