AI Phase4
AI Phase4
Introduction:
Spam emails have become an increasing difficulty for the entire web-users. These unsolicited
messages waste the resources of network unnecessarily. Customarily, machine learning
techniques are adopted for filtering email spams. This article examines the capabilities of the
extreme learning machine (ELM) and support vector machine (SVM) for the classification of
spam emails with the class level (d). The ELM method is an efficient model based on single
layer feedforward neural network, which can choose weights from hidden layers, randomly.
Support vector machine is a strong statistical learning theory used frequently for
classification. The performance of ELM has been compared with SVM. The comparative study
examines accuracy, precision, recall, false positive and true positive. Moreover, a sensitivity
analysis has been performed by ELM and SVM for spam email classification.
Process:
Email spam detection is a common problem in the field of machine learning. In this context,
we can use machine learning algorithms to classify emails into spam and non-spam
categories. One such algorithm is the Support Vector Machine (SVM) algorithm.
1. Data Collection: Collect a dataset of emails that are labeled as spam or non-spam.
2. Data Preprocessing: Preprocess the collected data by removing stop words,
stemming, and tokenizing the text.
3. Feature Extraction: Extract features from the preprocessed data using techniques like
Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
4. Model Training: Train a machine learning model on the extracted features using
algorithms like SVM, Naive Bayes, or Random Forest.
5. Model Evaluation: Evaluate the performance of the trained model using metrics like
accuracy, precision, recall, and F1-score.
Dataset Link:(https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
Data Collection:
Gather a large and diverse dataset of emails or messages, labeled as spam or non-spam
(ham). This dataset is crucial for training your AI model.
Data Preprocessing
Clean and preprocess the data. This may involve tasks like removing special characters,
stemming, and tokenization.
Feature Extraction: Extract relevant features from the preprocessed data. Common
techniques include Bag-of-Words, TF-IDF (Term Frequency-Inverse Document Frequency), or
word embeddings like Word2Vec.
Training the Model: Train your chosen model using the preprocessed data. Use a portion
of the dataset for training and another portion for validation to fine-tune the model.
Evaluation:
Evaluate the model’s performance using metrics like accuracy, precision, recall, and F1-
score.
This helps in understanding how well your model is performing.
Fine-Tuning:
Based on the evaluation results, fine-tune the model. This could involve adjusting
hyperparameters, trying different algorithms, or exploring advanced techniques like
ensemble methods.
Python program:
#importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve,
roc_auc_score
import nltk
from nltk.corpus import stopwords
from collections import Counter
df= pd.read_csv("/kaggle/input/sms-spam-
collectiondataset/spam.csv",encoding='ISO-8859-1')
df
Output:
df.info()
# Downloading the stopwords dataset
nltk.download('stopwords')
df
new_column_names = {"v1":"Category","v2":"Message"}
df.rename(columns = new_column_names,inplace = True)
df[df.duplicated()]
Data Visualisation:
sns.countplot(data=df, x='Category')
plt.xlabel('Category')
plt.ylabel('count')
plt.title('Distribution of mails')
plt.show()
Data Preprocessing:
# Convert the "Category" column values to numerical representation (0 for
"spam" and 1 for "ham")
X = df["Message"]
Y = df["Category"]
X
Y
In [20]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)
(5169,)
(4135,)
(1034,)
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)
Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)
print(X_train)
print(X_train_features)
Model Training:
model = LogisticRegression()
model.fit(X_train_features, Y_train)
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train,
prediction_on_training_data)
print("Accuracy on training data:",accuracy_on_training_data)
# Make predictions on the test data and calculate the accuracy
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test,prediction_on_test_data)
if (prediction)[0] == 1:
print("Ham Mail")
else:
print("Spam Mail")
if (prediction)[0] == 1:
print("Ham Mail")
else:
print("Spam Mail")
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
# Data visualization - Top 10 Most Common Words in Spam Emails
stop_words = set(stopwords.words('english'))
spam_words = " ".join(df[df['Category'] == 0]['Message']).split()
ham_words = " ".join(df[df['Category'] == 1]['Message']).split()
plt.bar(*zip(*spam_word_freq.most_common(10)), color='g')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Most Common Words in Spam Emails')
plt.xticks(rotation=45)
plt.show()
# Data visualization - Top 10 Most Common Words in Ham Emails
Output:
Conclusion:
The number of people using mobile devices increasing day by day. SMS (short message
service) is a text message service available in smartphones as well as basic phones. So, the
traffic of SMS increased drastically. The spam messages also increased. The spammers try to
send spam messages for their financial or business benefits like market growth, lottery ticket
information, credit card information, etc. So, spam classification has special attention. In this
paper, we applied various machine learning and deep learning techniques for SMS spam
detection. we used a dataset from UCI and build a spam detection model. Our experimental
results have shown that our LSTM model outperforms previous models in spam detection
with an accuracy of 98.5%. We used python for all implementations.