BT-3435 ALI (2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

( AFFILIATED TO CHAUDHARY CHARAN SINGH UNIVERSITY , MEERUT )

MIJAR PROJECT REPORT


ON
Email Spam Detection Model

( Submitted in partial fulfilment of the requirement for the


Bachelors of Computer Application )
( SESSION : 2021-2024 )

SUBMITTED BY :-
SHANU CHAUHAN ( 211013106079 )
SHIVAM KUMAR ( 211013106082 )

RISHABH UMAR DUBY (211013106069)

Astt. Prof. Ms. Vandana

GN GROUP OF INSTITUTES , GREATER NOIDA


Department of BCA
GREATER NOIDA INSTITIUTE OF
MANAGMENT
( AFFILIATED TO CHAUDHARY CHARAN SINGH
UNIVERSITY , MEERUT )

MIJAR PROJECT REPORT ON


Email Spam Detection Model

( Submitted in partial fulfilment of the requirement for the Bachelors of Computer


Application )
( SESSION : 2021-2024 )

UNDER THE GUIDANCE OF :- SUBMITTED BY :-


SHANU CHAUHAN ( 211013106040
) Astt. Prof. Ms. Vandana SHIVAM Kumar ( 211013106082 )

Rishabh kumar Duby(211013106069)


CERTIFICATE
This is to certify that the Project work entitled “Email Spam
Detection Model ”

is a bona fide work carried out by shanu chauhan ( 211013106079 )

in Partial fulfilment of the requirements for the award of degree of


“ Bachelors of Computer Application ’’ 6th semester of “ Greater
Noida Institute Of
Management ”

Knowledge Park 2 (U.P).

The result embodied in this report have not been submitted to


any other university or institute for the award of any degree
or diploma.

Name of Guide :
Prof. VANDANA

2
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the BCA. Project
undertaken during BCA. Final Year. We owe special debt of gratitude………,
Department of Bachelor Computer Application, Greater Noida Institute of Technology,
Greater Noida, India for his constant support and guidance throughout the course of our
work. His/Her sincerity, thoroughness and perseverance have been a constant source of
inspiration for us. It is only his cognizant efforts that our endeavors have seen light of the
day.

We also take the opportunity to acknowledge the contribution of...................................... ,


Head, Department of Bachelor Computer Application, Greater Noida Institute of
Technology, Greater Noida, India for his full support and assistance during the
development of the project.

We also do not like to miss the opportunity to acknowledge the contribution of all faculty
members of the department for their kind assistance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for their
contribution in the completion of the project.

Signature:

Name: Shanu Chauhan

Roll No.: 211013106079

Date : 05/06/2024

Signature:

Name : Shivam Kumar

Roll No.: 211013106082

Date : 05/06/2024

Signature:

Name : Rishabh Kumar Duby

Roll No.: 2110113106069

Date:05/06/2024

3
ABSTRACT
Unsolicited emails, popularly referred to as spam, have remained one
of the biggest threats to cybersecurity globally. More than half of the
emails sent in 2021 were spam, resulting in huge financial losses. The
tenacity and perpetual presence of the adversary, the spammer, has
necessitated the need for improved efforts at filtering spam. This study,
therefore, developed baseline models of random forest and extreme
gradient boost (XGBoost) ensemble algorithms for the detection and
classification of spam emails using the Enron1 dataset. The developed
ensemble models were then optimized using the grid-search cross-
validation technique to search the hyperparameter space for optimal
hyperparameter values. The performance of the baseline (un-tuned)
and the tuned models of both algorithms were evaluated and compared.
The impact of hyperparameter tuning on both models was also
examined. The findings of the experimental study revealed that the
hyperparameter tuning improved the performance of both models when
compared with the baseline models. The tuned RF and XGBoost
models achieved an accuracy of 97.78% and 98.09%, a sensitivity of
98.44% and 98.84%, and an F1 score of 97.85% and 98.16%,
respectively. The XGBoost model outperformed the random forest
model. The developed XGBoost model is effective and efficient for spam
email detection.
Nowadays communication plays a major role in everything be it
professional or personal. Email communication service is being used
extensively because of its free use services, low-cost operations,
accessibility, and popularity. Emails have one major security flaw that is
anyone can send an email to anyone just by getting their unique user
id. This security flaw is being exploited by some businesses and ill-
motivated persons for advertising, phishing, malicious purposes, and
finally fraud. This produces a kind of email category called SPAM.

4
TABLE OF CONTENTS

CHAPTER No TITLE PAGE No


Abstract 5
List of Figures 8
List of Tables 9
1 Introduction 10
2 Literature Review 12
2.1 introduction 12
2.2 Related work 12
2.3 Summary 13
3 Objectives and Scope 14
3.1 Problem statement 14
3.2 Objectives 14
3.3 Project Scope 14
3.4 Limitations 14
4. Experimentation and Methods 15
4.1 Introduction 15
4.2 System architecture 15
4.3 Modules and Explanation 15
4.4 Requirements 17
4.5 Workflow 17
4.5.1 Data collection and Description 18
4.5.2 Data Processing 19
4.5.2.1 Overall Data Processing 19
4.5.2.2 Textual Data Processing 19
4.5.2.3 Feature Vector Processing 20
4.5.2.3.1 bag of words 20
4.5.2.3.2 TF-IDF 20
4.5.3 Data Splitting 23
4.5.4 Machine Learning 23
4.5.4.1 Introduction 23
4.5.4.2 Algorithms 23

5
4.5.4.2.1 Naïve bayes Classifier 23

4.5.4.2.2 Random Forest Classifier 24

4.5.4.2.3 Logistic Regression 25

4.5.4.2.4 K-Nearest Neighbors 26

4.5.4.2.5 Support Vector machines 26

4.5.5 Experimentation 27

4.5.6 User Interface(UI) 30

4.5.7 Working Procedure 31

5 Results and Discussion 32

5.1 Language Model selection 32

5.2 Proposed Model 32

5.3 Comparison 32

5.4 Summary 34

6 Conclusion and Future Scope 35

6.1 Conclusion 35

6.2 Future Work 35

References 36

Appendices 38

A. Source code 38

B. Screenshots 43

6
List of
Figures

Fig No Title Pg no

4.1 Architecture 15

4.2 Workflow 17

4.3 Enron Data 18

4.4 Ling spam 18

4.5 Naïve Bayes(Bow vs TF-IDF) 27

4.6 Logistic Regression(Bow vs TF-IDF) 28

4.7 Neighbors vs Accuracy(KNN) 28

4.8 KNN(Bow vs TF-IDF) 29

4.9 Random Forest(trees vs scores) 29

4.10 Random Forest(Bow vs TF-IDF) 29

4.11 SVM(Bow vs TF-IDF) 30

5.1 Bow vs TF-IDF(Cumulative) 32

5.2 Comparision of Models 33

7
List of Tables

Table number Table Name Page no


4.1 Term Frequency 22

4.2 Inverse document frequency 22


4.3 TF-IDF 22
5.1 Models and results 33

8
CHAPTER 1
INTRODUCTION

Unsolicited emails, popularly referred to as spam, have remained one of the biggest
threats to cybersecurity globally. Between October 2020 and September 2021, a total
of 336.41 billion emails were sent globally, and about 84% (more than half) of these
emails were spam [1]. The huge financial loss resulting from email fraud is quite
enormous and increasing. According to the FBI center for crime complaint reports [2],
in 2021 about USD2.4 billion was lost as a result of scams associated with business
and email account compromises. In the same year, the bureau received 19,954 scam
email complaints. The IC3 data also showed that 3729 ransomware incidents were
reported with an associated financial loss of over USD49 million. According to the
spam and phishing report by Kaspersky on Securelist [3], between February and June
2022, 1.8 million 419 scam emails were detected. These statistics imply that
spammers are relentless. Researchers have continued to propose different techniques
to combat the spam menace [4,5,6,7,8,9,10]. However, the tenacity and perpetual
presence of the adversary, the spammer, has necessitated the need for improved
efforts at filtering spam. A spam-filtering model with improved accuracy will help in the
fight against spam-based fraud. Many current spam-email-detection techniques rely
on a single model, which can be prone to errors and overfitting [10,11,12,13,14].
Ensemble models, which combine the predictions of multiple models, have the
potential to improve the accuracy and robustness of spam detection. While ensemble
models have been widely used in other areas of machine learning, they have not been
widely applied to spam email detection. Hyperparameters, such as the number of
decision trees in a random forest or the regularization parameter in an extreme
gradient boost algorithm, can greatly affect the performance of a model. However,
finding the optimal hyperparameters are often ignored because it is a time-consuming
and computationally expensive task. Therefore, this study is aimed at the
hyperparameter optimization of the random forest (RF) and extreme gradient boosting
(XGBoost) ensemble algorithms. This is in a bid to enhance the predictive accuracy of
the two ensemble models and to determine the best-performing model, robust enough
for efficient spam email detection. Ensemble algorithms rely on a combination of
9
predictions from two or more base models to obtain an improved prediction
performance on a dataset [15]. In this study:
Spam-email-detection models based on the random forest and XGBoost machine-
learning algorithms were developed.
The performances of the ensemble models were optimized through hyperparameter
tuning.
The performances of the ensemble models were evaluated and compared before and
after hyperparameter tuning.
The convergence time of the models were also established.
The other sections of this study are presented thus: In the second section, a brief
highlight of related research on spam email classification and detection was presented.
The third section described in detail the dataset and preprocessing techniques,
methods, and performance evaluation metrics. The results of the experiments are
presented in the fourth section. Finally, a conclusion was drawn with a perspective for
further studies in the last section.

Today, Spam has become a major problem in communication over internet. It has
been accounted that around 55% of all emails are reported as spam and the number
has been growing steadily. Spam which is also known as unsolicited bulk email has
led to the increasing use of email as email provides the perfect ways to send the
unwanted advertisement or junk newsgroup posting at no cost for the sender. This
chances has been extensively exploited by irresponsible organizations and resulting
to clutter the mail boxes of millions of people all around the world.

Spam has been a major concern given the offensive content of messages, spam is a
waste of time. End user is at risk of deleting legitimate mail by mistake. Moreover,
spam also impacted the economical which led some countries to adopt legislation.

10
Literature Review

1.1 Introduction
This chapter discusses about the literature review for machine learning classifier that being
used in previous researches and projects. It is not about information gathering but it
summarize the prior research that related to this project. It involves the process of
searching,reading, analysing, summarising and evaluating the reading materials based on
the project.

A lot of research has been done on spam detection using machine learning. But due to the
evolvement of spam and development of various technologies the proposed methods are
notdependable. Natural language processing is one of the lesser known fields in machine
learning and it reflects here with comparatively less work present.

1.2 Related work


2 In recent times, the machine-learning approach to spam email detection has
continued to increase in addition to other spam-filtering techniques such as
list-based (Whitelist, Greylist, and Real-time blacklist) and word-based
(Heuristic filters, word-based: DNS lookup), challenge-response, and so on.
3 Several studies have applied machine and deep learning with the intent of
improving the performance of spam filters for classifying emails. In this study
[16], a technique for detecting spam emails was introduced. This method
utilizes a decision-tree-mining approach and focuses on the email header
rather than the entire content of the email. An incremental learning algorithm
based on the C4.5 decision tree algorithm is also used to improve the
technique’s ability to adapt to changes in the structure of spam. The model
achieved a precision rate of 96%.
4 This study [17] distinguished between two types of spam emails: complete
spam, which is spam that is considered spam by all users, and semi-spam,
which is spam that is considered spam by some users but not by others. They
developed a method for identifying spam that combines Bayesian filtering for
complete spam with a crowdsourcing mechanism for identifying semi-spam.
12
The crowdsourcing aspect of the method involves soliciting reports of spam
from contacts or credible users with similar interests. The authors achieved
an accuracy rate of 95.1% and believed that their model’s performance could
be improved by applying the concept of virtual credits to stimulate self-
centered nodes to report spam and by enhancing connectivity and throughput.
5 This study [18] presented an email-filtering approach that utilizes semantic
methods, specifically the WordNet ontology, to classify emails as either spam
or non-spam. The approach aims to reduce the high dimensionality of email
text by applying semantic methods and similarity measures, and then further
reduces the number of features through the use of feature-selection
techniques such as Principal Component Analysis (PCA) and Correlation
Feature Selection (CFS). The proposed approach was tested on the Enron
dataset and was found to have a high accuracy rate of above 90% when using
the logistic regression classification algorithm, with a reduction in the number
of features by over 90%. The proposed method was also found to have a
higher accuracy rate and faster performance compared to other related
approaches.
6 The study by [19] combined Particle Swarm Optimization (PSO) with naïve
Bayes (NB) to create a new model for classifying emails as spam or non-
spam. The model was trained using 1000 emails from the Ling dataset, and
features were selected from the bag of words using Correlation-based Feature
Selection (CFS). The performance of the new NB and PSO model was compared
to that of the ordinary NB model, and it was found that the NB and PSO model had a greater
performance for all evaluation metrics (precision, recall, F-measure, and accuracy), with
values above 94%, while the ordinary NB model had values below 89% for all metrics.
7 This study [20] carried out a systematic review of several machine-learning
applications and their performances in spam detection. The current trends and
open research areas in spam filtering were discussed extensively. The
strength and weaknesses of the algorithms, such as the Bayesian
classification, random forest, ANNs, SVMs, deep learning, Artificial Immune
Systems, and Rough sets, amongst others, were compared. They verified that
significant progress has been made and more is required in the struggle to
13
end spamming. Finally, they recommended machine-, deep-, and deep-
adversarial-learning algorithms as possible future technology for the effective
management of spam emails.
8 This study [21] investigated various classification-based data-mining
techniques such as the J48 decision tree, random forest, naïve Bayes, and
SMO for identifying spam emails and analyzing their performance on a spam
dataset. WEKA was used to train and explore the performance of the different
classifiers and identify the best-performing one for classifying email spam. The
classifiers’ performance was evaluated based on the standard evaluation
metrics used for machine learning models and the training time. random forest
outperformed the other models for all metrics evaluated and achieved a
weighted F-measure of 95.50%. naïve Bayes also did well in terms of
execution time.
9 This study [22] evaluated the performance of five classifiers: logistic
regression, decision tree, naïve Bayes (NB), K-Nearest Neighbors (KNN), and
Support Vector Machine (SVM). They used the WEKA tool to train and test
the algorithms on the Spambase dataset from the UCI machine-learning
repository. The decision tree and KNN algorithms had the best performance,
with an accuracy rate of 99% for all metrics. However, KNN took longer to
converge than the other algorithms.
10 This study [14] presented a new approach to detecting spam emails that
combines the artificial bee colony algorithm with a logistic regression
classification model. The proposed method was tested on three publicly
available datasets, namely, Enron, CSDMC2010, and TurkishEmail. The
model achieved a higher classification accuracy (98.91%) than the other
methods considered. The study reported that the proposed method is effective
at handling high-dimensional data and performs better than other machine-
learning methods, including support vector machine, logistic regression, and
naïve Bayes, as well as state-of-the-art techniques from previous studies.
11 Towards an accurate detection of spam in mobile message communication,
this study [23] developed a machine-learning-based approach for detecting
spam messages in mobile device communication. Three classifiers–logistic
14
regression (LR), K-Nearest Neighbor (K-NN), and decision tree (DT)–were
applied to the SMS spam collection dataset and evaluated on their ability to
classify ham and spam messages. The dataset was split into training and
testing sets. The results showed that LR had the highest classification
performance, with an accuracy of 99%, and outperformed K-NN and DT with
95% and 98% accuracy, respectively.
12 This study [13] integrated KNN with five bio-inspired algorithms to optimize
the spam email detection of the KNN model. The bio-inspired algorithms are
Grey wolf optimization, Firefly optimization, Chicken swarm optimization,
Grasshopper optimization, and Whale optimization. In the study, alongside the
evaluation of the performance of each of the algorithms with KNN, the
performance of distance measures such as Manhattan, Euclidean, and
Chebychev was measured. The findings revealed that the Whale optimization
algorithm integrated with the KNN model is quite promising for most of the
evaluation metrics.

15
Methodology

The Enron dataset was used in this research because it is the only substantial
collection of an actual email that is public and also because of its high level of
usage among researchers. The Enron dataset is made up of 6 main directories,
each directoryhas several subdirectories, each containing emails as a single text
file [28]. In this study,the Enron1 dataset was used. These emails were converted
to a single CSV file by Marcel Wiechmann [29]. The CSV file contained about
33,000 emails. However, during conversion, some of the email messages were not
correctly aligned with their labels. The non-aligning messages were removed
alongside the orphaned labels through Microsoft Excel. On completion of the
removal, the CSV file contained a total of 32,860 emails. Of the total emails, 16,026
(49%) are legitimate emails (ham) and 16,834 (51%) are spam. The CSV file
contained five columns labeled message ID, subject, spam/ham,and date. The
subject and date columns were not used in this study. The needed columns were
readjusted as serial numbers. Column two contains the class label of eachof the
emails and column three contains the text of each email.
Dataset Cleaning

To improve the quality of classification, it is important to get rid of unwanted


characters or features that constitute noise from the data. The cleaning activities and
functions used are presented in Table 1.
Table 1. Collection of functions used in data cleaning.

The noise constituent of the dataset such as non-ASCII characters, HTML tags,
extra white spaces, URLs, punctuations, numbers, and stop words were removed in
steps 1–7 (Table 1). Stop words are a collection of words in any language that occur
with a high frequency but convey considerably less meaningful information about the
significance of an expression. The removal of stop words and other noise constituents
shrinks the size of the data and reduces the burden of computational expenses in model
training, with the potential of improving model performance since there are only
meaningful words left to learn from.

16
The essence of stemming is also to reduce the data size by reducing words to their
root word. The text-mining map function, defined in the text-mining (tm) package [30],
received the parameters specified in steps 5–8 to execute each cleaning activity. The
DocumentTermMatrix() and removeSparseTerms() functions are also defined in the tm
package in R.
On completion of the data cleaning process, the data was divided into two; the train
set and the test set. The train set, which was composed of 70% (23,107) of the original
dataset, has 11,282 legitimate emails and 11,825 spam emails (Figure 1). The test set,
which is composed of 30% (9753) of the original dataset, has 4744 legitimate emails
and 5009 spam emails.

Figure 1. Distribution of ham and spam emails in train and test dataset.
Baseline models of random forest and extreme gradient boost (XGBoost) were
developed by training and testing each model independently with 70% and 30% of the
preprocessed dataset (Figure 2). All the parameters were set to their default values
during the training and testing of the baseline models. The performance of these models
on the test data was recorded as the baseline performance to be improved via
hyperparameter tuning. To reduce the computation time, only the important predictors

17
were passed to the random forest. The random forest has an inbuilt feature for ranking
variables or features based on their importance in arriving at a prediction [31].

Figure 2. The proposed model workflow.

3.3.1. Random Forest

Random forest is a supervised ensemble classifier that is used for classification and
regression. It is an ensemble learning method that creates a set of decision trees and
combines them to make a final prediction. To arrive at a prediction, the random forest
follows these steps:
Step 1: Select a random sample of data from the dataset.
Step 2: Build a decision tree using the sample data.
Step 3: Repeat the process a certain number of times, creating a new decision tree
each time.
Step 4: Combine the decision trees by taking the average of their predictions.
Each decision tree in the random forest makes a prediction, and the final prediction

18
is made by taking the average of all the predictions made by the individual decision trees.
This helps to reduce overfitting and improve the overall accuracy of the model. The two
important hyperparameters that must be defined by the user when generating a random
forest are mtry and ntree [32]. In a random forest, mtry is the number of features that are
randomly sampled as candidates for splitting at each decision tree node and the ntree
is the number of decision trees in the random forest [32]. The mtry parameter determines
how much randomness is injected into the model. A smaller mtry value will make the
model more deterministic and potentially more accurate but at the cost of a more
complex model that is more prone to overfitting. A larger mtry value will make the model
more robust to noise in the data, but at the cost of accuracy. The ntree parameter
determines the overall complexity of the model. A larger value of ntree will make the
model more accurate but at the cost of increased computational resources and longer
training time. The number of trees in a forest (ntree) is not limited by computational
resources, but the performance improvement from having a large number of trees is
minimal, according to [33]. However, [34] states that computational resources are the
limiting factor for the number of trees in a forest.

19
Modules and Explanation
The Application consists of three modules.
i. UI
ii. Machine Learning
iii. Data Processing
I. UI Module
a. This Module contains all the functions related to UI(user interface).
b. The user interface of this application is designed using Streamlit library from
python based packages.
1 6
c. The user inputs are acquired using the fun c tions of this library and forwarded to
data processing module for processing and conversion.
d. Finally the output from ML module is sent to this module and from this module to
user in visual form.

II. Machine Learning Module


a. This module is the main module of all three modules.
b. This modules performs everything related to machine learning and results analysis.
c. Some main functions of this module are
i. Training machine learning models.
ii. Testing the model
iii. Determining the respective parameter values for each model.
iv. Key-word extraction.
v. Final output calculation
d. The output from this module is forwarded to UI for providing visual response to
user

III. Data Processing Module


a. The raw data undergoes several modifications in this module for further process.
b. Some of the main functions of this module includes
i. Data cleaning
ii. Data merging of datasets
iii. Text Processing using NLP
iv. Conversion of text data into numerical data(feature vectors).
v. Splitting of data.

20
c. All the data processing is done using Pandas and NumPy libraries.
d. Text processing and text conversion is done using NLTK and scikit-learn libraries.

Requirements
Hardware Requirements
PC/Laptop
Ram – 8 Gig
Storage – 100-200 Mb
17
Software Requirements
OS – Windows 7 and above
Code Editor – Pycharm, VS Code, Built in IDE
Anaconda environment with packages nltk, numpy, pandas, sklearn, tkinter, nltk data.
Supported browser such as chrome, firefox, opera etc..

WorkFlow

fig no. 4.2 Workflow

In the above architecture, the objects depicted in Green belong to a module called Data
Processing. It includes several functions related to data processing, natural Language
Processing. The objects depicted in Blue belong to the Machine Learning module. It is where
everything related to ML is embedded. The red objects represent final results and outputs.

21
Data Description
Dataset : enronSpamSubset.
Source : Kaggle
Description : this dataset is part of a larger dataset
called enron. This dataset contains a set of spam and
non-spam emails with 0 for non spam and 1 for spam
in label attribute.
Composition :
Unique values : 9687
Spam values : 5000
Non-spam values : 4687
fig no. 4.3 enron spam

Dataset : lingspam.
Source : Kaggle
Description : This dataset is part of a larger dataset called
Enron1 which contains emails classified as spam or
ham(not-spam).
Composition :
Unique values : 2591
Spam values : 419
Non-spam values : 2172

Data Processing
Overall data processing
It consists of two main tasks
● Dataset cleaning
It includes tasks such as removal of outliers, null value removal, removal of
unwanted features from data.
● Dataset Merging
After data cleaning, the datasets are merged to form a single dataset containing
only two features(text, label). 19
Data cleaning, Data Merging these procedures are completely done using
Pandas library.

22
Textual data processing
● Tag removal
Removing all kinds of tags and unknown characters from text using regular
expressions through Regex library.
● Sentencing, tokenization
Breaking down the text(email/SMS) into sentences and then into
tokens(words).
This process is done using NLTK pre-processing library of python.
● Stop word removal
Stop words such as of , a ,be , … are removed using stopwords NLTK library
of python.
● Lemmatization
Words are converted into their base forms using lemmatization and
pos-tagging
This process gives key-words through entity extraction.
This process is done using chunking in regex and NLTK lemmatization.
● Sentence formation
The lemmatized tokens are combined to form a sentence.
This sentence is essentially a sentence converted into its base form and
removing stop words.
Then all the sentences are combined to form a text.
● While the overall data processing is done only to datasets, the textual
processing is done to both training data, testing data and also user input data.

20

23
Feature Vector Formation
● The texts are converted into feature vectors(numerical data) using the words
present in all the texts combined
● This process is done using countvectorization of NLTK library.
● The feature vectors can be formed using two language models Bag of Words
and Term Frequency-inverse Document Frequency.

Bag of Words
Bag of words is a language model used mainly in text classification. A bag of words
represents the text in a numerical form.
The two things required for Bag of Words are
• A vocabulary of words known to us.
• A way to measure the presence of words.
Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens.
“ It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness, ”
The unique words here (ignoring case and punctuation) are:
[ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ]
The next step is scoring words present in every document.

After scoring the four lines from the above stanza can be represented in vector form as
“It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

This is the main process behind the bag of words but in reality the vocabulary even from a
couple of documents is very large and words repeating frequently and important in nature
are taken and remaining are removed during the text processing stage.

Term Frequency-inverse document frequency


Term frequency-inverse document frequency of a word is a measurement of the
importance of a word. It compares the repen2t1ance of words to the collection of
documentsand calculates the score.

24
Terminology for the below formulae:
t – term(word)
d – document(set of words)
N – count of documents
The TF-IDF process consists of various activities listed below.

i) Term Frequency
The count of appearance of a particular word in a document is called term frequency
𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅

ii) Document Frequency


Document frequency is the count of documents the word was detected in. We consider
one instance of a word and it doesn’t matter if the word is present multiple times.
𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔
iii) Inverse Document Frequency
• IDF is the inverse of document frequency.
• It measures the importance of a term t considering the information it contributes.
Every term is considered equally important but certain terms such as (are, if, a,
be, that, ..) provide little information about the document. The inverse document
frequency factor reduces the importance of words/terms that has highe
recurrence and increases the importance of words/terms that are rare.
𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇

Finally, the TF-IDF can be calculated by combining the term frequency and inverse
document frequency.

𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏))

the process can be explained using the following example:

“Document 1 It is going to rain today.


Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.”

22
The Bag of words of the above sentences is
[going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1]

25
Then finding the term frequency
table no. 4.1 Term frequency

Then finding the inverse document frequency


table no. 4.2 inverse document frequency

Applying the final equation the values of tf-idf becomes

table no. 4.3 TF-IDF 23

26
Using the above two language models the complete data has been converted into two
kinds of vectors and stored into a csv type file for easy access and minimal processing.

Data Splitting
The data splitting is done to create two kinds of data Training data and testing data.
Training data is used to train the machine learning models and testing data is used to test the
models and analyse results. 80% of total data is selected as testing data and remaining data
is testing data.

Machine Learning
Introduction
Machine Learning is process in which the computer performs certain tasks without giving
instructions. In this case the models takes the training data and train on them.
Then depending on the trained data any new unknown data will be processed based on the
ruled derived from the trained data.

After completing the countvectorization and TF-IDF stages in the workflow the data is
converted into vector form(numerical form) which is used for training and testing models.

For our study various machine learning models are compared to determine which method is
more suitable for this task. The models used for the study include Logistic Regression, Naïve
Bayes, Random Forest Classifier, K Nearest Neighbors, and Support Vector Machine
Classifier and a proposed model which was created using an ensemble approach.

Algorithms
a combination of 5 algorithms are used for the classifications.

Naïve Bayes Classifier


A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for
classification tasks. The main principle behind this model is the Bayes theorem.

Bayes Theorem:
Naive Bayes is a classification technique that is based on Bayes’ Theorem with an
24
assumption that all the features that predict the target value are independent of each other. It
calculates the probability of each class and then picks the one with the highest probability.

27
Naive Bayes classifier assumes that the features we use to predict the target are
independent and do not affect each other. Though the independence assumption is never
correct in real-world data, but often works well in practice. so that it is called “Naive” [14].

P(A│B)=(P(B│A)P(A))/P(B)
P(A|B) is the probability of hypothesis A given the data B. This is called the posterior
probability.
P(B|A) is the probability of data B given that hypothesis A was true.
P(A) is the probability of hypothesis A being true (regardless of the data). This is called the
prior probability of A.
P(B) is the probability of the data (regardless of the hypothesis) [15].

Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve
Bayes model is that it treats every word in a text as independent and is equal in importance
but every word cannot be treated equally important because articles and nouns are not the
same when it comes to language. But due to its classification efficiency, this model is used in
combination with other language processing techniques.

Random Forest Classifier

Random Forest classifier is a supervised ensemble algorithm. A random forest consists of


multiple random decision trees. Two types of randomnesses are built into the trees. First, each
tree is built on a random sample from the original data. Second, at each tree node, a subset
of features is randomly selected to generate the best split [16].

Decision Tree:
The decision tree is a classification algorithm based completely on features. The tree
repeatedly splits the data on a feature with the best information gain. This process continues
until the information gained remains constant. Then the unknown data is evaluated feature by
feature until categorized. Tree pruning techniques are used for improving accuracy and
reducing the overfitting of data.

Several decision trees are created on subsets of data the result that was given by the majority
of trees is considered as the final result. The number of trees to be created is determined
based on accuracy and other metrics th
2r5ough iterative methods

28
Logistic Regression

Logistic Regression is a “Supervised machine learning” algorithm that can be used to model
the probability of a certain class or event. It is used when the data is linearly separable and
the outcome is binary or dichotomous [17]. The probabilities are calculated using a sigmoid
function.

For example, let us take a problem where data has n features.


We need to fit a line for the given data and this line can be represented by the equation

z=b_0+b_1 x_1+b_2 x_2+b_3 x_3….+b_n x_n

here z = odds
generally, odds are calculated as

odds=p(event occurring)/p(event not occurring)

Sigmoid Function:
A sigmoid function is a special form of logistic function hence the name logistic regression.
The logarithm of odds is calculated and fed into the sigmoid function to get continuous
probability ranging from 0 to 1.

The logarithm of odds can be calculated by

log(odds)=dot(features,coefficients)+intercept

and these log_odds are used in the sigmoid function to get probability.

h(z)=1/(1+e^(-z) )
The output of the sigmoid function is an integer in the range 0 to 1 which is used to
determine which class the sample belongs to. Generally, 0.5 is considered as the limit below
which it is considered a NO, and 0.5 or higher will be considered a YES. But the border can
be adjusted based on the requirement.

29
K-Nearest Neighbors

KNN is a classification algorithm. It comes under supervised algorithms. All the data points
are assumed to be in an n-dimensional space. And then based on neighbors the category of
current data is determined based on the majority.
Euclidian distance is used to determine the distance between points.

The distance between 2 points is calculated as


d=√( (x2-x1) ^2+ (y2-y1) ^2 )

The distances between the unknown point and all the others are calculated. Depending on
the K provided k closest neighbors are determined. The category to which the majority of the
neighbors belong is selected as the unknown data category.

If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared
to other distance-based algorithms such as SVM as it needs to determine the distance to all
points to get the closest neighbors to the given point.

Support Vector Machines(SVM)

It is a machine learning algorithm for classification. Decision boundaries are drawn between
various categories and based on which side the point falls to the boundary the category is
determined.

Support Vectors:

The vectors closer to boundaries are called support vectors/planes. If there are n categories
then there will be n+1 support vectors. Instead of points, these are called vectors because
they are assumed to be starting from the origin.The distance between the support vectors is
called margin. We want our margin to be as wide as possible because it yields better results.

There are three types of boundaries used by SVM to create boundaries.


Linear: used if the data is linearly separable.
Poly: used if data is not separable. It creates any data into 3-dimensional data.
Radial: this is the default kernel used in SVM. It con2v7erts any data into infinite-dimensional
data.

30
If the data is 2-dimensional then the boundaries are lines. If the data is 3-dimensional then
the boundaries are planes. If the data categories are more than 3 then boundaries are called
hyperplanes.

An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the
data to all other data to get the prediction due to this SVM’s tend to be quick with predictions.

Experimentation

The process goes like data collection and processing then natural language processing and
then vectorization then machine learning.The data is collected, cleaned, and then subjected
to natural language processing techniques specified in section IV. Then the cleaned data is
converted into vectors using Bag of Words and TF-IDF methods which goes like...

The Data is split into Training data and Testing Data in an 80-20 split ratio. The training and
testing data is converted into Bag-of-Words vectors and TF-IDF vectors.

There are several metrics to evaluate the models but accuracy is considered for comparing
BoW and TF-IDF models. Accuracy is generally used to determine the efficiency of a model.

Accuracy:
“Accuracy is the number of correctly predicted data points out of all the data points”.

Naïve Bayes Classification algorithm:

Two models, one for Bow and one for TF-IDF are created and trained using respective
training vectors and training labels. Then the respective testing vectors and labels are used
to get the score for the model.

28
fig no. 4.5 naïve Bayes

31
The scores for Bag-of-Words and TF-IDF are visualized.

The scores for the Bow model and TF-IDF models are 98.04 and 96.05 respectively for
using the naïve bayes model.

Logistic Regression:

Two models are created following the same procedure used for naïve Bayes models and
then tested the results obtained are visualized below.

fig no. 4.6 Logistic Regression (Bow vs TF-IDF)

The scores for BoW and TF-IDF models are 98.53 and 98.80 respectively.
K-Nearest Neighbors:

Similar to the above models the models are created and trained using respective vectors and
labels. But in addition to the data, the number of neighbors to be considered should alsobe
provided.
Using Iterative Method K =3 (no of Neighbors) provided the best results for the BoW model
and K = 9 provided the best results for the TF-IDF model.

Using the K values the scores for BOW and TF-IDF are visualized below.

fig no. 4.7 Neighbors vs Accuracy


29
Taking K=3 and K=9 for Bow and TF-IDF

32
respectively the scores are calculated and are presented below.

fig no. 4.8 KNN (Bow vs TF-IDF)

Random Forest:

Similar to previous algorithms two models are created and trained using respective
training vectors and training labels. But the number of trees to be used for forest has to be
provided.

fig no. 4.9 Random Forest (trees vs


score)

Using the Iterative method best value for the


number of trees is determined. From the results, it
is clear that 19 estimators provide the best score
for both the BoW and TF-IDF models. The no of
tress and scores for both models are visualized.

The scores for BoW and TF-IDF models are 30


visualized.

33
( fig no. 4.10 Random Forest(bow vs tfidf)
Support Vector Machines (SVM):
Finally, two SVM models, one for BoW and one for TF-IDF are created and then
trained using respective training vectors and labels. Then tested using testing vectors and
labels.

fig no. 4.11 SVM(Bow vs TF_IDF)

The scores for BoW and TF-IDF models are 59.41 and 98.82 respectively.

Proposed Model:

In our proposed system we combine all the models and make them into one. It takes an
unknown point and feeds it into every model to get predictions. Then it takes these predictions,
finds the category which was predicted by the majority of the models, and finalizes it.

To determine which model is effective we used three metrics Accuracy, Precision, and
F1score. In the earlier system, we used only the F1 Score because we were not determining
which model is best but which language model is best suited for classification.

User Interface(UI)
interface (UI) is an important component in this application. The user only interacts
with the interface.
The UI of this project has been constructed with the help of an open source library called
streamlit. The complete information and API reference sheet can be obtained from here

31

34
Working Procedure

The working procedure includes the internal working and the data flow of application.
After running the application some procedures are automated.
Reading data from file
Cleaning the texts
Processing
Splitting the data
Intialising and training the models
The user just needs to provide some data to classify in the area provided.
The provided data undergoes several procedures after submission.
Textual Processing
Feature Vector conversion
Entity extraction
The created vectors are provided to trained models to get predictions.
After getting predictions the category predicted by majority will be selected.
The accuracies of that prediction will be calculated
The accuracies and entities extracted from the step 3 will be provided to user.Every time the
user gives something new the procedure from step 2 will be repeated.

35
3. Results and Discussion
Language Model Selection
While selecting the best language model the data has been converted into both types of
vectors and then the models been tested for to determine the best model for classifying
spam.
The results from individual models are presented in the experimentation section under
methodology. Now comparing the results from the models.

fig no. 5.1 Bow vs TF-IDF (Cumulative)

From the figure it is clear that TF-IDF proves to be better than BoW in every model tested.
Hence TF-IDF has been selected as the primary language model for textual data conversion
in feature vector formation.

Proposed Model results


To determine which model is effective we used three metrics Accuracy, Precision, and
F1score.
The resulted values for the proposed model are
Accuracy – 99.0
Precision – 98.5
F1 Score – 98.6

Comparison
The results from the proposed model has been compared with all the models individually in
tabular form to illustrate the differences clearly. 33

36
Metric Accuracy Precision F1 Score
Model

Naïve Bayes 96.0 99.2 95.2

Logistic 98.4 97.8 98.6


Regression
Random forest 96.8 96.4 96.3

KNN 96.6 96.9 96.0

SVM 98.8 97.8 98.6

Proposed 99.0 98.5 98.6


model

Table no. 5.1 Models and results

The color RED indicates that the


value is lower than the proposed
model and GREEN indicates equal or
higher.

Here we can observe that our proposed model outperforms almost every other model in
every metric. Only one model(naïve Bayes) has slightly higher accuracy than our model but
it is considerably lagging in other metrics.
The results are visually presented below for easier understanding and comparison.

37
4.Conclusion and Future Scope
Conclusion
This study evaluated and compared the performance of two ensemble models
based on the random forest and extreme gradient boost ensemble algorithms. Baseline
random forest and XGBoost spam detection models were developed based on the train/test
split technique using the default parameters. The grid-search technique with 10-fold cross-
validation was applied to search the hyperparameter space to determine the optimal
hyperparameter values that optimized the performance of the random forest and XGBoost
models. The performance of the baseline models was evaluated and compared with that of
the tuned random forest and XGBoost models to examine the impact of hyperparameter
tuning. The findings revealed that hyperparameter tuning improved the performance of the
random forest and XGBoost models. The results also showed that the tuned XGBoost model
outperformed the tuned random forest model for all metrics evaluated. The effectiveness of
these ensemble models in spam email detection and classification was demonstrated.
It will be interesting to compare the XGBoost model and deep-learning models for spam
detection in a future study in a bid to gain further insight into the development of efficient and
effective spam email detection systems.
It is important to note that the distribution of classes in the dataset used in this study is
complementarily balanced. The behavior of these models will be different with a significantly
imbalanced dataset. Future studies will look at the performance of these models on an
imbalanced dataset.

Future work

There are numerous appilcations to machine learning and natural language


processing and when combined they can solve some of the most troubling
problems concerned with texts. This application can be scaled to intake text in
bulk so that classification can be done more affectively in some public sites.

Other contexts such as negative, phishing, malicious, etc,. can be used to


train the model to filter things such as public comments in various social sites.
This applicationcan be converted to online type of machine learning system and
can be easily updated with latest trends of spam and other mails so that the
system can adapt to new types of spam emails and texts.

38
References
Dixon, S. Global Average Daily Spam Volume 2021. Available
online: https://www.statista.com/statistics/1270424/daily-spam-volume-global/ (accessed on 18 July
2022).
FBI. Federal Bureau of Investigation: Internet Crime Report 2021. Available
online: https://www.ic3.gov/Media/PDF/AnnualReport/2021_IC3Report.pdf (accessed on 6 August
2022).
Securelist Types of Text-Based Fraud. Available online: https://securelist.com/mail-text-
scam/106926/ (accessed on 4 August 2022).
Onova, C.U.; Omotehinwa, T.O. Development of a Machine Learning Model for Image-Based Email
Spam Detection. FUOYE J. Eng. Technol. 2021, 6, 336–340. [Google Scholar] [CrossRef]
Bindu, V.; Thomas, C. Knowledge Base Representation of Emails Using Ontology for Spam
Filtering. Adv. Intell. Syst. Comput. 2021, 1133, 723–735. [Google Scholar] [CrossRef]
Kaddoura, S.; Chandrasekaran, G.; Popescu, D.E.; Duraisamy, J.H. A Systematic Literature Review
on Spam Content Detection and Classification. PeerJ Comput. Sci. 2022, 8, e830. [Google Scholar]
[CrossRef] [PubMed]
Méndez, J.R.; Cotos-Yañez, T.R.; Ruano-Ordás, D. A New Semantic-Based Feature Selection
Method for Spam Filtering. Appl. Soft Comput. 2019, 76, 89–104. [Google Scholar] [CrossRef]
Ahmed, N.; Amin, R.; Aldabbas, H.; Koundal, D.; Alouffi, B.; Shah, T. Machine Learning Techniques
for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Secur. Commun.
Networks 2022, 2022, 1862888. [Google Scholar] [CrossRef]
Hosseinalipour, A.; Ghanbarzadeh, R. A Novel Approach for Spam Detection Using Horse Herd
Optimization Algorithm. Neural Comput. Appl. 2022, 34, 13091–13105. [Google Scholar] [CrossRef]
Ismail, S.S.I.; Mansour, R.F.; Abd El-Aziz, R.M.; Taloba, A.I. Efficient E-Mail Spam Detection Strategy
Using Genetic Decision Tree Processing with NLP Features. Comput. Intell. Neurosci. 2022, 2022,
7710005. [Google Scholar] [CrossRef]
Ravi Kumar, G.; Murthuja, P.; Anjan Babu, G.; Nagamani, K. An Efficient Email Spam Detection
Utilizing Machine Learning Approaches. Proc. Lect. Notes Data Eng. Commun. Technol. 2022, 96,
141–151. [Google Scholar]
Kontsewaya, Y.; Antonov, E.; Artamonov, A. Evaluating the Effectiveness of Machine Learning
Methods for Spam Detection. Procedia Comput. Sci. 2021, 190, 479–486. [Google Scholar]
[CrossRef]
Batra, J.; Jain, R.; Tikkiwal, V.A.; Chakraborty, A. A Comprehensive Study of Spam Detection in E-
Mails Using Bio-Inspired Optimization Techniques. Int. J. Inf. Manag. Data Insights 2021, 1, 100006.
[Google Scholar] [CrossRef]
Dedeturk, B.K.; Akay, B. Spam Filtering Using a Logistic Regression Model Trained by an Artificial
Bee Colony Algorithm. Appl. Soft Comput. J. 2020, 91, 106229. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble Learning: A Survey. Wiley Interdiscip. Rev. Data Min. Knowl.
Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Sheu, J.J.; Chu, K.T.; Li, N.F.; Lee, C.C. An Efficient Incremental Learning Mechanism for Tracking
Concept Drift in Spam Filtering. PLoS ONE 2017, 12, e0171518. [Google Scholar] [CrossRef]
Liu, X.; Zou, P.; Zhang, W.; Zhou, J.; Dai, C.; Wang, F.; Zhang, X. CPSFS: A Credible Personalized
Spam Filtering Scheme by Crowdsourcing. Wirel. Commun. Mob. Comput. 2017, 2017, 1457870.
[Google Scholar] [CrossRef]
Bahgat, E.M.; Rady, S.; Gad, W.; Moawad, I.F. Efficient Email Classification Approach Based on
Semantic Methods. Ain Shams Eng. J. 2018, 9, 3259–3269. [Google Scholar] [CrossRef]
Agarwal, K.; Kumar, T. Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle
Swarm Optimization. In Proceedings of the 2nd International Conference on Intelligent Computing
and Control Systems, ICICCS 2018, Madurai, India, 14–15 June 2018; Institute of Electrical and
Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 685–690. [Google Scholar]
Dada, E.G.; Bassi, J.S.; Chiroma, H.; Abdulhamid, S.M.; Adetunmbi, A.O.; Ajibuwa, O.E. Machine
Learning for Email Spam Filtering: Review, Approaches and Open Research
Problems. Heliyon 2019, 5, e01802. [Google Scholar] [CrossRef] [PubMed]

39
Saha, S.; DasGupta, S.; Das, S.K. Spam Mail Detection Using Data Mining: A Comparative
Analysis. Smart Innov. Syst. Technol. 2019, 104, 571–580. [Google Scholar] [CrossRef]
Nandhini, S.; Marseline, D.J. Performance Evaluation of Machine Learning Algorithms for Email Spam
Detection. In Proceedings of the International Conference on Emerging Trends in Information
Technology and Engineering, ic-ETITE 2020, Vellore, India, 24–25 February 2020; Institute of
Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020. [Google Scholar]
Guangjun, L.; Nazir, S.; Khan, H.U.; Haq, A.U. Spam Detection Approach for Secure Mobile Message
Communication Using Machine Learning Algorithms. Secur. Commun. Networks 2020, 2020,
8873639. [Google Scholar] [CrossRef]
Jancy Sickory Daisy, S.; Rijuvana Begum, A. Smart Material to Build Mail Spam Filtering Technique
Using Naive Bayes and MRF Methodologies. Proc. Mater. Today 2021, 47, 446–452. [Google
Scholar] [CrossRef]
Xia, T.; Chen, X. A Weighted Feature Enhanced Hidden Markov Model for Spam SMS
Filtering. Neurocomputing 2021, 444, 48–58. [Google Scholar] [CrossRef]
Şimşek, H.; Aydemir, E. Classification of Unwanted E-Mails (Spam) with Turkish Text by Different
Algorithms in Weka Program. J. Soft Comput. Artif. Intell. 2022, 3, 1–10. [Google Scholar] [CrossRef]
Xia, T.; Chen, X. Category-Learning Attention Mechanism for Short Text
Filtering. Neurocomputing 2022, 510, 15–23. [Google Scholar] [CrossRef]
ENRON. The Enron-Spam Datasets. Available online: https://www2.aueb.gr/users/ion/data/enron-
spam/ (accessed on 16 August 2022).
Wiechmann, M. GitHub—MWiechmann/Enron_spam_data: The Enron-Spam Dataset Preprocessed
in a Single, Clean Csv File. Available
online: https://github.com/MWiechmann/enron_spam_data (accessed on 17 August 2022).
Feinerer, I. Introduction to the Tm Package Text Mining in R. Available online: https://cran.r-
project.org/web/packages/tm/vignettes/tm.pdf (accessed on 16 August 2022).
Kolog, E.A.; Balogun, O.S.; Adjei, R.O.; Devine, S.N.O.; Atsa’am, D.D.; Dada, O.A.; Omotehinwa,
T.O. Predictive Model for Early Detection of Mother’s Mode of Delivery with Feature Selection.
In Delivering Distinctive Value in Emerging Economies; Anning-Dorson, T., Boateng, S.L., Boateng,
R., Eds.; Productivity Press: New York, NY, USA, 2022; pp. 241–264. ISBN 9781003152217. [Google
Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Oshiro, T.M.; Perez, P.S.; Baranauskas, J.A. How Many Trees in a Random Forest? Proc. Lect. Notes
Comput. Sci. 2012, 7376, 154–168. [Google Scholar]
Guan, H.; Li, J.; Chapman, M.; Deng, F.; Ji, Z.; Yang, X. Integration of Orthoimagery and Lidar Data
for Object-Based Urban Thematic Mapping Using Random Forests. Int. J. Remote Sens. 2013, 34,
5166–5186. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA,
USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Oyewola, D.O.; Dada, E.G.; Omotehinwa, T.O.; Emebo, O.; Oluwagbemi, O.O. Application of Deep
Learning Techniques and Bayesian Optimization with Tree Parzen Estimator in the Classification of
Supply Chain Pricing Datasets of Health Medications. Appl. Sci. 2022, 12, 10166. [Google Scholar]
[CrossRef]
Hoque, K.E.; Aljamaan, H. Impact of Hyperparameter Tuning on Machine Learning Models in Stock
Price Forecasting. IEEE Access 2021, 9, 163815–163830. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting
Algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]

40
38

41
A. Source code
1. Module – Data Processing
import re
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from collections import defaultdict
import spacy

tag_map = defaultdict(lambda : wn.NOUN)


tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
lemmatizer=WordNetLemmatizer()
stop_words=set(stopwords.words('english'))

nlp=spacy.load('en_core_web_sm')

def process_sentence(sentence):
nouns = list()
base_words = list()
final_words = list()
words_2 = word_tokenize(sentence)
sentence = re.sub(r'[^ \w\s]', '', sentence)
sentence = re.sub(r'_', ' ', sentence)
words = word_tokenize(sentence)
pos_tagged_words = pos_tag(words)

for token, tag in pos_tagged_words:

base_words.append(lemmatizer.lemmatize(token,tag_map[tag[0]]))
for word in base_words:
if word not in stop_words:
final_words.append(word)
sym = ' '
sent = sym.join(final_words)
pos_tagged_sent = pos_tag(words_2)
for token, tag in pos_tagged_sent:
if tag == 'NN' and len(token)>1:
nouns.append(token)
return sent, nouns

def clean(email):
email = email.lower()
sentences = sent_tokenize(email)
total_nouns = list()
string = "" 39
for sent in sentences:
sentence, nouns = process_sentence(sent)

42
string += " " + sentence
total_nouns += nouns
return string, nouns

def ents(text):
doc = nlp(text)
expls = dict()
if doc.ents:
for ent in doc.ents:
labels = list(expls.keys())
label = ent.label_
word = ent.text
if label in labels:
words = expls[label]
words.append(word)
expls[label] = words
else:
expls[label] = [word]
return expls
else:
return 'no'

2. Module – Machine Learning


from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

class model:
def init (self):
self.df = pd.read_csv('Cleaned_Data.csv')
self.df['Email'] = self.df.Email.apply(lambda email:
np.str_(email))
self.Data = self.df.Email
self.Labels = self.df.Label
self.training_data, self.testing_data,
self.training_labels, self.testing_labels =
train_test_split(self.Data,self.Labels,random_state=10)
self.training_data_list = self.training_data.to_list()
self.vectorizer = TfidfVectorizer()
self.training_vectors =
self.vectorizer.fit_transform(self.training_data_list)
self.model_nb = MultinomialNB()
self.model_svm = SVC(probability=True)
self.model_lr = LogisticRegression()
self.model_knn = KNeighbors4
C0lassifier(n_neighbors=9)
self.model_rf = RandomForestClassifier(n_estimators=19)

43
self.model_nb.fit(self.training_vectors,
self.training_labels)
self.model_lr.fit(self.training_vectors,
self.training_labels)
self.model_rf.fit(self.training_vectors,
self.training_labels)
self.model_knn.fit(self.training_vectors,
self.training_labels)
self.model_svm.fit(self.training_vectors,
self.training_labels)
def get_prediction(self,vector):
pred_nb=self.model_nb.predict(vector)[0]
pred_lr=self.model_lr.predict(vector)[0]
pred_rf=self.model_rf.predict(vector)[0]
pred_svm=self.model_svm.predict(vector)[0]
pred_knn=self.model_knn.predict(vector)[0]
preds=[pred_nb,pred_lr,pred_rf,pred_svm,pred_knn]
spam_counts=preds.count(1)
if spam_counts>=3:
return 'Spam'
return 'Non-Spam'
def get_probabilities(self,vector):
prob_nb=self.model_nb.predict_proba(vector)[0]*100
prob_lr = self.model_lr.predict_proba(vector)[0] * 100
prob_rf = self.model_rf.predict_proba(vector)[0] * 100
prob_knn = self.model_knn.predict_proba(vector)[0] * 100
prob_svm = self.model_svm.predict_proba(vector)[0] * 100
return [prob_nb,prob_lr,prob_rf,prob_knn,prob_svm]

def get_vector(self,text):
return self.vectorizer.transform([text])

3. Module – User interface


import time
from ML import model
import streamlit as st
from DP import *
import matplotlib.pyplot as plt
import seaborn as sns
inputs=[0,1]
@st.cache()
def create_model():
mode=model()
return mode
col1,col2,col3,col4,col5=st.columns(5)
with col3:
st.title("Spade")
st.write('welcome to Spade...')
st.write('A Spam Detection algorithm based on Machine Learning
and Natural Language Processing')
text=st.text_area('please provide email/text you wish to
classify',height=400,placeholder='type/paste more than 50
characters here') 41

44
file=st.file_uploader("please upload file with your text.. (only
.txt format supported")

if len(text)>20:
inputs[0]=1
if file is None:
inputs[1]=0
if inputs.count(1)>1:
st.error('multiple inputs given please select only one
option')
else:
if inputs[0]==1:
e=text
given_email = e
if inputs[1]==1:
bytes_data = file.getvalue()

given_email = bytes_data
predictions=[]
probs=[]
col1,col2,col3,col4,col5=st.columns(5)
with col3:
clean_button = st.button('Detect')
st.caption("In case of a warning it's probably related to
caching of your browser")
st.caption("please hit the detect button again ... ")

if clean_button:
if inputs.count(0)>1:
st.error('No input given please try after giving the
input')
else:
with st.spinner('Please wait while the model is
running ... '):
mode = create_model()
given_email,n=clean(given_email)
vector = mode.get_vector(given_email)
predictions.append(mode.get_prediction(vector))
probs.append(mode.get_probabilities(vector))
col1, col2, col3 = st.columns(3)
with col2:
st.header(f"{predictions[0]}")
probs_pos = [i[1] for i in probs[0]]
probs_neg = [i[0] for i in probs[0]]
if predictions[0] == 'Spam':
# st.caption(str(probs_pos))
plot_values = probs_pos
else:
# st.caption(str(probs_neg))
plot_values = probs_neg
plot_values=[int(i) for i in plot_values]
st.header(f'These are the results obtained from the
models')
col1, col2 = st.columns([2,423])
with col1:
st.subheader('predicted Accuracies of models')

45
with st.expander('Technical Details'):
st.write('Model-1 : Naive Bayes')
st.write('Model-2 : Random Forest')
st.write('Model-3 : Logistic Regression')
st.write('Model-4 : K-Nearest Neighbors')
st.write('Model-5 : Support Vector Machines')
with col2:
st.write('Model-1', plot_values[0])
bar1 = st.progress(0)
for i in range(plot_values[0]):
time.sleep(0.01)
bar1.progress(i)
st.write('Model-2', plot_values[1])
bar2 = st.progress(0)
for i in range(plot_values[1]):
time.sleep(0.01)
bar2.progress(i)
st.write('Model-3', plot_values[2])
bar3 = st.progress(0)
for i in range(plot_values[2]):
time.sleep(0.01)
bar3.progress(i)
st.write('Model-4', plot_values[3])
bar4 = st.progress(0)
for i in range(plot_values[3]):
time.sleep(0.01)
bar4.progress(i)
st.write('Model-5', plot_values[4])
bar5 = st.progress(0)
for i in range(plot_values[4]):
time.sleep(0.01)
bar5.progress(i)
st.header('These are some insights from the given
text.')
entities=ents(text)
col1,col2=st.columns([2,3])
with col1:
st.subheader('These are the named entities extracted
from the text')
st.write('please expand each category to view the
entities')
st.write('a small description has been included with
entities for user understanding')
with col2:
if entities=='no':
st.subheader('No Named Entities found.')
else:
renames = {'CARDINAL': 'Numbers', 'TIME':
'Time', 'ORG': 'Companies/Organizations', 'GPE': 'Locations',
'PERSON': 'People', 'MONEY': 'Money',
'FAC': 'Factories'}
for i in renames.keys():
with st.expander(renames[i]):
st.caption(4s3pacy.explain(i))
values = list(set(entities[i]))
strin = ', '.join(values)

46
st.write(strin)
B. Screenshots

44

47
48
49

You might also like