0% found this document useful (0 votes)
33 views

AI Phase4

The document describes building a machine learning model to classify emails as spam or ham (non-spam) using a support vector machine algorithm. It discusses collecting a labeled dataset, preprocessing the data, extracting features, training the model, evaluating performance, and testing the model on custom emails. The model achieved over 90% accuracy on the test data. It also visualizes the top words in spam vs ham emails and a confusion matrix to analyze model performance.

Uploaded by

techusama4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

AI Phase4

The document describes building a machine learning model to classify emails as spam or ham (non-spam) using a support vector machine algorithm. It discusses collecting a labeled dataset, preprocessing the data, extracting features, training the model, evaluating performance, and testing the model on custom emails. The model achieved over 90% accuracy on the test data. It also visualizes the top words in spam vs ham emails and a confusion matrix to analyze model performance.

Uploaded by

techusama4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Building a Smarter AI-Powered Spam Classifier

TEAM MEMBER: J.VIJAYAN

Phase-4 Notebook Submission

Introduction:
Spam emails have become an increasing difficulty for the entire web-users. These unsolicited
messages waste the resources of network unnecessarily. Customarily, machine learning
techniques are adopted for filtering email spams. This article examines the capabilities of the
extreme learning machine (ELM) and support vector machine (SVM) for the classification of
spam emails with the class level (d). The ELM method is an efficient model based on single
layer feedforward neural network, which can choose weights from hidden layers, randomly.
Support vector machine is a strong statistical learning theory used frequently for
classification. The performance of ELM has been compared with SVM. The comparative study
examines accuracy, precision, recall, false positive and true positive. Moreover, a sensitivity
analysis has been performed by ELM and SVM for spam email classification.
Process:
Email spam detection is a common problem in the field of machine learning. In this context,
we can use machine learning algorithms to classify emails into spam and non-spam
categories. One such algorithm is the Support Vector Machine (SVM) algorithm.

Training the Model:

1. Data Collection: Collect a dataset of emails that are labeled as spam or non-spam.
2. Data Preprocessing: Preprocess the collected data by removing stop words,
stemming, and tokenizing the text.
3. Feature Extraction: Extract features from the preprocessed data using techniques like
Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
4. Model Training: Train a machine learning model on the extracted features using
algorithms like SVM, Naive Bayes, or Random Forest.
5. Model Evaluation: Evaluate the performance of the trained model using metrics like
accuracy, precision, recall, and F1-score.

Dataset Link:(https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

Given Data Set:

Data Collection:
Gather a large and diverse dataset of emails or messages, labeled as spam or non-spam
(ham). This dataset is crucial for training your AI model.

Data Preprocessing
Clean and preprocess the data. This may involve tasks like removing special characters,
stemming, and tokenization.

Feature Extraction: Extract relevant features from the preprocessed data. Common
techniques include Bag-of-Words, TF-IDF (Term Frequency-Inverse Document Frequency), or
word embeddings like Word2Vec.

Choosing a Model: Select an appropriate machine learning algorithm or deep learning


architecture for your task. Popular choices include Naïve Bayes, Support Vector Machines,
or deep learning models like Recurrent Neural Networks (RNNs) or Transformers.

Training the Model: Train your chosen model using the preprocessed data. Use a portion
of the dataset for training and another portion for validation to fine-tune the model.

Evaluation:
Evaluate the model’s performance using metrics like accuracy, precision, recall, and F1-
score.
This helps in understanding how well your model is performing.

Fine-Tuning:
Based on the evaluation results, fine-tune the model. This could involve adjusting
hyperparameters, trying different algorithms, or exploring advanced techniques like
ensemble methods.

Testing and Deployment:


Test the final model on a separate test dataset to ensure its generalizability. Once you’re
confident in its performance, deploy the model into your application or system.

Monitoring and Maintenance:


Continuously monitor the model’s performance in real-world scenarios. Spam patterns can
change over time, so periodic updates and retraining might be necessary to maintain the
classifier’s accuracey.

Python program:
#importing libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve,
roc_auc_score
import nltk
from nltk.corpus import stopwords
from collections import Counter

#libraries for data visualization


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df= pd.read_csv("/kaggle/input/sms-spam-
collectiondataset/spam.csv",encoding='ISO-8859-1')
df

Output:

df.info()
# Downloading the stopwords dataset

nltk.download('stopwords')

# Drop unnecessary columns from the DataFrame

columns_to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]


df.drop(columns=columns_to_drop, inplace=True)

df

# Rename the columns "v1 and "v2" to new names

new_column_names = {"v1":"Category","v2":"Message"}
df.rename(columns = new_column_names,inplace = True)

df[df.duplicated()]

#Drop duplicated values


df=df.drop_duplicates()
df
df.info()
df.describe()
Df.shape
df['Category'].value_counts()

Data Visualisation:
sns.countplot(data=df, x='Category')
plt.xlabel('Category')
plt.ylabel('count')
plt.title('Distribution of mails')
plt.show()
Data Preprocessing:
# Convert the "Category" column values to numerical representation (0 for
"spam" and 1 for "ham")

df.loc[df["Category"] == "spam", "Category"] = 0


df.loc[df["Category"] == "ham", "Category"] = 1
df.head()

# Separate the feature (X) and target (Y) data

X = df["Message"]
Y = df["Category"]

X
Y

# Split the data into training and testing sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,


random_state = 42)

In [20]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)
(5169,)
(4135,)
(1034,)

Feature Extraction: TF-IDF:


# Create a TF-IDF vectorizer to convert text messages into numerical features

feature_extraction = TfidfVectorizer(min_df=1, stop_words="english",


lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)

print(X_train)

print(X_train_features)

Model Training:

# Create a logistic regression model and train it on the training data

model = LogisticRegression()
model.fit(X_train_features, Y_train)

Model Evaluation and Prediction:

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train,
prediction_on_training_data)
print("Accuracy on training data:",accuracy_on_training_data)
# Make predictions on the test data and calculate the accuracy

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test,prediction_on_test_data)

print("Accuracy on test data:",accuracy_on_test_data)

# Test the model with some custom email messages

input_mail = ["Congratulations! You've won a free vacation to an exotic


island. Just click on the link below to claim your prize."]
input_data_features = feature_extraction.transform(input_mail)
prediction = model.predict(input_data_features)

if (prediction)[0] == 1:
print("Ham Mail")
else:
print("Spam Mail")

Output: Spam Mail

input_mail = ["This is a friendly reminder about our meeting scheduled for


tomorrow at 10:00 AM in the conference room. Please make sure to prepare your
presentation and bring any necessary materials."]
input_data_features = feature_extraction.transform(input_mail)
prediction = model.predict(input_data_features)

if (prediction)[0] == 1:
print("Ham Mail")
else:
print("Spam Mail")

Output: Ham Mail

# Data visualization - Confusion Matrix


cm = confusion_matrix(Y_test, prediction_on_test_data)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
# Data visualization - Top 10 Most Common Words in Spam Emails

stop_words = set(stopwords.words('english'))
spam_words = " ".join(df[df['Category'] == 0]['Message']).split()
ham_words = " ".join(df[df['Category'] == 1]['Message']).split()

spam_word_freq = Counter([word.lower() for word in spam_words if word.lower()


not in stop_words and word.isalpha()])
plt.figure(figsize=(10, 6))

plt.bar(*zip(*spam_word_freq.most_common(10)), color='g')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Most Common Words in Spam Emails')
plt.xticks(rotation=45)
plt.show()
# Data visualization - Top 10 Most Common Words in Ham Emails

ham_word_freq = Counter([word.lower() for word in ham_words if word.lower()


not in stop_words and word.isalpha()])
plt.figure(figsize=(10, 6))
plt.bar(*zip(*ham_word_freq.most_common(10)), color='maroon')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Most Common Words in Ham Emails')
plt.xticks(rotation=45)
plt.show()

Output:
Conclusion:
The number of people using mobile devices increasing day by day. SMS (short message
service) is a text message service available in smartphones as well as basic phones. So, the
traffic of SMS increased drastically. The spam messages also increased. The spammers try to
send spam messages for their financial or business benefits like market growth, lottery ticket
information, credit card information, etc. So, spam classification has special attention. In this
paper, we applied various machine learning and deep learning techniques for SMS spam
detection. we used a dataset from UCI and build a spam detection model. Our experimental
results have shown that our LSTM model outperforms previous models in spam detection
with an accuracy of 98.5%. We used python for all implementations.

You might also like