0% found this document useful (0 votes)
30 views7 pages

2023 V14i805

spam

Uploaded by

preetraj710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views7 pages

2023 V14i805

spam

Uploaded by

preetraj710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Journal of Engineering Sciences Vol 14 Issue 08,2023

Email Spam Detection Using Machine Learning


Algorithms
1 2
MS.R.JYIOSNA DEVI , YAMMANURI HEMANAND
1
Asst. Professor, Dept of MCA, Audisankara Institute of Technology

(AUTONOMOUS), Gudur, AP,India.


2
PG Scholar, Dept of MCA, Audisankara Institute of Technology

(AUTONOMOUS), Gudur, AP,India.

as a reliable form of communication by the


Internet
ABSTRACT:

Email Spam has become a major problem


nowadays, with Rapid growth of internet users,
Email spams is also increasing. People are using
them for illegal and unethical conducts, phishing users [1]. Over the decades, e-mail services have
and fraud. Sending malicious link through spam been evolved into a powerful tool for the
emails which can harm our system and can also exchange of different kind of information. 1e
seek in into your system. Creating a fake profile increased use of the e-mail also entails more spam
and email account is much easy for the spammers, attacks for the Internet users. Spam can be sent
they pretend like a genuine person in their spam from anywhere on the planet from users having
emails, these spammers target those peoples who deceptive intentions that has access to the
are not aware about these frauds. So, it is needed Internet. Spams are unsolicited and unwanted
to Identify those spam mails which are fraud, this emails sent to recipients who do not want or need
project will identify those spam by using them. 1ese spam emails have fake content with
techniques of machine learning, this paper will mostly links for phishing attacks and other
discuss the machine learning algorithms and threats, and these emails are sent in bulk to a large
apply all these algorithms on our data sets and number of recipients [2]. 1e intention behind them
best algorithm is selected for the email spam is to steal users’ personal information and then
detection having best precision and accuracy use them against their will to gain materialistic
benefits [3]. 1ese emails either contain malicious
1. INTRODUCTION content or have URLs that lead to malicious
content. Such emails are also sometimes referred
Internet has become an inseparable part of human to as phishing emails.
lives, where more than four and half billion
Internet users find it a convenient to use it for Despite the advancement of spam filtering
their facilitation. Moreover, emails are considered applications and services, there is no definitive
way to distinguish between legitimate and

ISSN:0377-9254 jespublication.com Page 32


Journal of Engineering Sciences Vol 14 Issue 08,2023

malicious emails because of the everchanging into account domain or linguistic details.
content of such emails. Spams have been sent for Linguistic techniques, such as those that take into
over three or four decades now, and with the account the contextual characteristics of
availability of various antispam services, even important terms in Roman Urdu literature, are
today, nonexpert end-users get trapped into such expected to improve classification results.
hideous pitfall [4]. In e-mail managers, spam
filters detect spam and forward it to a dedicated Backpropagation neural network (BPNN) was
space, spam folder, allowing the user to choose used in [7] to filter spam emails. 1ey gathered 200
whether or not to access them. Spam filtering spam-based emails and labelled half of them as
tools such as corporate e-mail systems, e-mail spam and half as ham. 1ey used the K-Means
filtering gateways, contracted antispam services, Clustering Algorithm to preprocess the dataset.
and end-user training can deal with spam emails Using the k-mean clustering approach in
in English or any other language [4]. However, preprocessing and backpropagation neural
they are ineffective at filtering spam emails in networks (BPNN) in the learning stage of the
other languages that recently have been digitized, model, they got a maximum accuracy of 95.42%.
such as Urdu Language. 1e proposed study 1e suggested work had limitations in that the
exploits the existing artificial intelligence models models take a long time to train and test. 1e
to detect spam emails written in Urdu. 1is article efficiency of hybrid feature selection in the
describes how machine learning (ML) and deep classification of emails was evaluated in [8]. 1ey
learning (DL) models such as Support Vector collected 169 emails, 114 of which were spam
Machine (SVM), Naive Bayes, Convolutional and 55 of which were ham. Using hybrid feature
Neural Network (CNN), and Long Short-Term selection, they were able to achieve the highest
Memory (LSTM), a recurrent neural network, can accuracy rate. 1ey employed Hybrid Feature
be trained to detect Urdu spam emails. Moreover, Selection (TF-IDF) with a total of four reduces
as there is no dataset for spam emails, this article and achieved an accuracy of 84.8%. 1e work’s
also explains its creation and training of various shortcoming is that the Malay language dataset
machine learning models. was not appropriate with TD-IDF and rough set
theory, and it did not provide the highest level of
2. LITERATURE SURVEY accuracy for the planned study. In [9], the authors
implemented Naive Bayes and J48 (Decision
Tree) algorithms in a machine learning-based
Before implementing a spam detection model hybrid bagging approach for spam e-mail
using machine learning or deep learning for e- detection. 1ey gathered 1000 emails, half of
mail written in Urdu, existing studies were which were spam, and the other half were ham.
studied, regardless of the language in which the e- For the training of both models, the dataset was
mail content was written. A comparative summary divided into two sections. 1ey employed Naive
of the same is also explained in Table 1. 1e Bayes, J48, and a hybrid bagging strategy to
authors in [6] gathered 1463 tweets written in classify spam and ham emails, with J48 providing
Roman Urdu and categorized 1038 of them as the highest accuracy, which is 93.6%.
ham and 425 of them as spam. On that data, they
used discriminative multinomial Naive Bayes 1e boosting strategy replaces the flawed
techniques. 1ey got 95.12% with DMNB Text and classifier’s learning characteristics with those of
95.42% with NB. 1e techniques were used with a the base classifier, improving the overall system
numerical sequence of words that did not take efficiency. 1e concept of boosting technique could

ISSN:0377-9254 jespublication.com Page 33


Journal of Engineering Sciences Vol 14 Issue 08,2023

be used for additional study in order to improve colony optimization. 1e authors in [4]
the system’s outcomes. 1e authors in [1] used implemented four machine learning algorithms
machine learning algorithms to detect spam from the pool of algorithms. For classifying
emails. 1ey compiled a dataset using online tools spam/ham e-mail detection in Urdu, they chose
such as ‘kaggle’ and others. 1ey have collected Naive Bayes, SVM, KNN, and RF. 1ey generated
5573 emails and used that data to train seven their own dataset for Urdu emails but did not
machine learning models. 1e greatest result is provide any information about it. Using NB, they
98.5% accuracy with Multinomial Nave Bayes; were able to attain the greatest accuracy of 89%.
however, it has obvious limitations as class- 1e study’s drawbacks include that these machine
conditional dependency, which causes the system learning techniques are only successful for
to misidentify some data items. On the other limited, labelled datasets. 1ese algorithms take a
hand, ensemble approaches have been shown to long time to train, and the results they provide are
be effective since they use many learners to also mediocre
predict categories.
3. EXISTING SYSTEM
In [10], the authors have also made significant
contributions to the field of spam e-mail Spam has become a big misfortune on the
detection. 1ey used kaggle and the UCI machine internet. Spam is a waste of storage, time and
learning repository to collect 5674 emails and message speed. Automatic email filtering may be
define them as spam or ham. 1ey estimated the most effective method of detecting spam but
accuracy using six machine learning classifiers. nowadays spammers can easily bypass all these
1ey explored a variety of ml algorithms; however, spam filtering applications easily. Several years
it was discovered that Ensemble Filter produces ago, most of the spam can be blocked manually
more remarkable outcomes and has accuracy of coming from certain email addresses. Machine
98.5%, which is higher than the other learners, as learning approach will be used for spam
well as faster testing. 1e article’s limitation is that detection. Major approaches adopted closer to
testing was done on an e-mail sample without junk mail filtering encompass “text analysis,
taking into account evolving trends in the mails, white and blacklists of domain names, and
which could impair a classifier’s effectiveness. community-primarily based techniques”. Text
For the filtration of spam emails, an integrated assessment of contents of mails is an extensively
Naive Bayes algorithm along with particle swarm used method to the spams
optimization (PSO) is defined in [11]. 1ey used
NB to train and classify emails and PSO for However, rejecting sends essentially dependent
swarm behavior property distribution. Finally, on content examination can be a difficult issue in
they used the proposed integrated concept NB and the event of bogus positives. Regularly clients and
PSO to achieve evaluation steps. 1ey employed a organizations would not need any legitimate
combined NB and PSO method. PSO is utilized to messages to be lost.
optimize the parameters of the NB technique.
Naive Bayes is employed as a separator among 4. PROPOSED SYSTEM
spam and ham emails based on the keywords. 1ey
achieved a maximum accuracy of 96.42% after
Machine learning approaches are more efficient, a
using an integrated NB method. It would be better
set of training data is used, these samples are the
if the Naive Bayes approach was used in
set of email which are pre classified. Machine
combination with ant colony or artificial bee
learning approaches have a lot of algorithms that

ISSN:0377-9254 jespublication.com Page 34


Journal of Engineering Sciences Vol 14 Issue 08,2023

can be used for email filtering. These algorithms fake news dataset put together from four or five
include “Naïve Bayes, support vector machines, other datasets available on Kaggle to check news
Neural Networks, K-nearest neighbour, Random veracity. This dataset contains 20,801 news
Forests etc.” reports related to the United States. Adding a full
news article further enhances its utility. It consists
Advantages of 4 columns providing information about the
“Filtering of spams can be done on the basis of news, author, bias , and label. The datasets
the trusted and verified domain names.” “The contain a balanced amount of fake and real news
spam email classification is very significant in ranging from various fields not constraining to
categorizing e-mails and to distinct e-mails that any particular sector.
are spam or non-spam.” “This method can be 2. Data preprocessing
used by the big body to differentiate decent mails
that are only the emails they wish to obtain.” To use the datasets, we needed to
preprocess and clean the dataset according to our
requirements so that we can apply the algorithms.
ARCHITECTURE DIAGRAM We have used Pandas, Scikit-Learn, and numpy
lib raries for the same. We started the
preprocessing by merging the 2 datasets based on
their column identity using the Pandas library.
After the combination of the two data sets, the
output data in CSV format has 3 columns.
Namely, the news text containing the concise
news portion, its author, and the output label.
Now as the Liar dataset labels are, ‘True’,
’Mostly True’, ’Half True’, ’Barely true’, ’False’,
and ‘pants fire’. Whereas the Fake News dataset
has labels to be 0(reliable),1(unreliable). So
initially Liar dataset labels are changed to binary
format by putting threshold for considering which
one is true or false. This is done with 0 beings:
1. Dataset used True, Mostly True, Half True, and 1 being Barely
true, false, and pants on fire by using Sklearn
Our experiment combines two datasets to
ttransformation function. The next step was to
evaluate the performance of our model. The Liar
clean the textual in formation. We removed the
dataset is a result of data collection by William
empty cells using Pandas’ drop axis function and
Wang and consists of 6 output labels is one the
used SKlearn’s library to remove hyphens and
largest dataset for fake news detection collected
punctuations fro m the author’s name as well as
fro m various sources like speeches, campaigns,
the news text. Then, the dataset was then divided
radio, or TV. There are 12836 short statements
into 3 different datasets with 60% being for the
with various information about the author. Each
training dataset, 20% for testing data and the
row of the dataset consists of short news with the
remaining 20% includes validation using Scikit’s
author and the output label and other columns.
functions. The training dataset contains 15408 and
For our experiment, we only require the columns
the test set contains 5136 news articles. The
of news, author, and label. Our second dataset is a

ISSN:0377-9254 jespublication.com Page 35


Journal of Engineering Sciences Vol 14 Issue 08,2023

dataset has been randomized before this to have • TF-IDF: It is a technique for
even distribution of the data. The training dataset information retrieval in which its value increases
has a distribution of 58% true and 42% false. The if that token occurs frequently in the document
data is then again processed to remove white and decreases if it occurs frequently in corpus
spaces and tokenized to remove unwarranted thus giving accurate metric value for that token. It
information from the dataset. gives word frequency score that tries to highlight
more interesting words .
3. Feature Extraction
Linguistic features are especially useful to
Our proposed approach aims to analyze identify deceiving information as to how they are
the feature extracted from the news and correlate written is to obscure the real information by
it with its speaker or author. The associated author highlighting or using certain non - descriptive and
is then provided with a credibility score too non-accurate words to attract people on the click
according to the amount of fake and real news baits or fake news [9]. The analysis of the bag of
that he publishes. A high credit score imp lies that words of articles would lend in to provide a
the authors published work is known to be true credib ility score. In our experiment, this analysis
and is heavily trusted. So our model approaches is done on the subject and brief textual
this problem in two ways, first by providing information on the news. The linguistic features
lexicographic feature extraction of the news and, provide a set of certain words that are typically
second by providing a credibility score to the used in spam messages. These spam messages are
author. After this, a network is created that links usually fake news that is present on social media.
the particular highly deceptive word list to each After analyzing and extraction we form a typical
source and forms a vector matrix. In our vector matrix of these words. The statements and
experiment, we have performed feature extract author that frequently use such ambiguous and
ion and selection methods from sci-kit learn deceptive statements are assigned low credibility
python libraries. For feature selection, we have score.
used methods like bag -of-words and n-grams and
then term frequency like TF -IDF (Term
Frequency- Inverse Document Frequency)
weighting. We have also used word2vec and POS 5. CONCLUSION
tagging to extract the features. A little light is
thrown on each of the feature extraction model
used: With the increase usage of emails, this study
focuses on using automated ways to detect spam
• Bag-of-words: It is a fundamental emails written in Urdu. 1e study uses various
way to represent text data and extract features machine learning and deep learning algorithms to
from the text. In this, we tokenize words for each detect them. In the study, a translated emails
observation and find its frequency. dataset including spam and ham emails is
generated from Kaggle, which is pre-processed
• N-grams: An n-gram is any
for various approaches. Accuracy, precision,
contiguous sequence of n tokens or words. Bag of
recall, F-measure, ROC-AUC, and model loss are
n-grams can be more informative than a bag of
used as comparative measures to examine
words because they capture more context around
performance. 1e study concludes that deep
each word
learning models are more successful in classifying
Urdu spam emails. Comparatively, LSTM

ISSN:0377-9254 jespublication.com Page 36


Journal of Engineering Sciences Vol 14 Issue 08,2023

algorithm has a high accuracy rate of around 98% Technology (ICACT), pp. 710–714, IEEE,
with low model loss rate of 5%. Even though PyeongChang, Korea (South), Feb 2016.
LSTM takes a little longer to train than CNN, [7] S. K. Tuteja and N. Bogiri, “Email spam
SVM, or Naive Bayes, its efficiency and accuracy filtering using Bonn classification algorithm,” in
rate are far better than those of the other Proceedings of the 2016 International Conference
approaches. 1e creation of an actual dataset of on Automatic Control and Dynamic Optimization
Urdu emails can be considered as a viable future Techniques (ICACDOT), pp. 915–919, IEEE,
task. In addition, more recent artificial intelligent Pune, India, Sep 2016.
approaches may also be considered to detect [8] M. Mohamad and A. Selamat, “An evaluation
spams. on the efficiency of hybrid feature selection in
spam email classification,” in Proceedings of the
2015 International Conference on Computer,
Communications, and Control Technology
6. REFERENCES (I4CT), pp. 227–231, IEEE, Kuching, Malaysia,
Apr 2015.
[1] N. Kumar, S. Sonowal, and Nishant, “Email [9] P. Sharma, U. Bhardwaj, and U. Bhardwaj,
spam detection using machine learning “Machine learning based spam e-mail detection,”
algorithms,” in Proceedings of the 2020 Second International Journal of Intelligent Engineering
International Conference on Inventive Research and Systems, vol. 11, no. 3, pp. 1–10, 2018.
in Computing Applications (ICIRCA), pp. 108– [10] S. Suryawanshi, A. Goswami, and P. Patil,
113, IEEE, Coimbatore, India, July 2020. “Email spam detection: an empirical comparative
[2] G. Jain, M. Sharma, and B. Agarwal, study of different ml and ensemble classifiers,” in
“Optimizing semantic lstm for spam detection,” Proceedings of the 2019 IEEE 9th International
International Journal of Information Technology, Conference on Advanced Computing (IACC), pp.
vol. 11, no. 2, pp. 239–250, 2019. 69–74, IEEE, Tiruchirappalli, India, Dec 2019.
[3] F. Masood, G. Ammad, A. Almogren et al., [11] K. Agarwal and T. Kumar, “Email spam
“Spammer detection and fake user identification detection using integrated approach of naıvebayes
on social networks,” IEEE Access, vol. 7, pp. and particle swarm optimization,” in Proceedings
68140–68152, 2019. of the 2018 Second International Conference on
[4] A. Akhtar, G. R. Tahir, and K. Shakeel, “A Intelligent Computing and Control Systems
mechanism to `tect Urdu spam emails,” in (ICICCS), pp. 685–690, IEEE, Madurai, India,
Proceedings of the 2017 IEEE 8th Annual June 2018.
Ubiquitous Computing, Electronics and Mobile [12] A. Iyengar, G. Kalpana, S. Kalyankumar, and
Communication Conference (UEMCON), pp. S. Guna Nandhini, “Integrated spam detection for
168–172, IEEE, New York, NY, USA, Oct 2017. multilingual emails,” in Proceedings of the 2017
[5] H. Drucker, D. Donghui Wu, and V. N. International Conference on Information
Vapnik, “Support vector machines for spam Communication and Embedded Systems
categorization,” IEEE Transactions on Neural (ICICES), pp. 1–4, IEEE, Chennai, India,
Networks, vol. 10, no. 5, pp. 1048–1054, 1999. February 2017.
[6] H. Afzal and K. Mehmood, “Spam filtering of [13] K. Kandasamy and P. Koroth, “An integrated
bi-lingual tweets using machine learning,” in approach to spam classification on twitter using
Proceedings of the 2016 18th International url analysis, natural language processing and
Conference on Advanced Communication machine learning techniques,” in Proceedings of
the 2014 IEEE Students’ Conference on

ISSN:0377-9254 jespublication.com Page 37


Journal of Engineering Sciences Vol 14 Issue 08,2023

Electrical, Electronics and Computer Science, pp.


1–5, IEEE, Bhopal, India, March 2014.
[14] X.-l. Chen, P.-y. Liu, Z.-f. Zhu, and Y. Qiu,
“A method of spam filtering based on weighted
support vector machines,”vol. 1, pp. 947–950, in
Proceedings of the 2009 IEEE International
Symposium on IT in Medicine & Education, vol.
1, pp. 947– 950, IEEE, Jinan, China, Aug 2009.

Authors Profile:
MS.R. Jyosna Devi, She
was completed Master of
computer Applications at KSIT. She is
dedicated teaching field from the last
5 years. Currently Working as an
Asst.Professor in the Department of
MCA at Audisankara Institute Of
Technology(AUTONOMOUS),Gudur,
Tirupathi (DT).

YAMMANURI
HEMANAND pursuing
his MCA from
Audisankara Institute
of Technology (AUTONOMOUS),
Gudur, Affiliated to JNTUA in 2023 ,
Andhra Pradesh, India

ISSN:0377-9254 jespublication.com Page 38

You might also like