0% found this document useful (0 votes)

4 views

daima jieshi

The document outlines a process for building a spam detection model using n-grams from SMS data. It includes data preprocessing, dataset splitting, n-gram generation, and a prediction function that utilizes log probabilities for classification. Additionally, it discusses the importance of separating ham and spam messages for effective model training and the impact of smoothing parameters on model performance.

Uploaded by

mlin41088

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

daima jieshi

Uploaded by

mlin41088

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

library(data.

table)
library(caTools)
library(caret)

# Read data
sms_data <- fread("E:/2024/AI/ML/project2/SMS/SMSSpamCollection", header
= FALSE, sep = "\t", quote = "")
colnames(sms_data) <- c("label", "text")

sms_data[, text := tolower(text)] converts the text content to lowercase.

sms_data[, text := paste0("{", text, "}")] By adding markers at the beginning
and end of the text, it allow the model to better identify where each message
begins and ends.

# Split dataset 80% for train data , 20% for test data.
set.seed(202)
split <- sample.split(sms_data$label, SplitRatio = 0.8)
train_data <- sms_data[split, ]
test_data <- sms_data[!split, ]

# Separate ham and spam

train_ham <- train_data[label == "ham", ]
train_spam <- train_data[label == "spam", ]

# then we generate n-grams , first let n = 3 , then create a function , unlist is

for expanding the list of trigrams generated for each text into a single vector.
sapply processes each txt one by one.
For each txt, the internal function(txt) {...} is executed to generate trigrams
of the text.
The internal function seq(nchar(txt) - n + 1) seq is used to generate a
sequence, indicating the starting position index from which n-grams can be
extracted.
nchar(txt) means how many character of the txt.
nchar(txt) - n + 1 is the last starting point that allows the extraction of
trigrams, which ensures that each trigram is complete.
And then , the function (i) , i represents the starting points of each trigrams,
the function is extract substring from the txt , the substring is from i to i+n-
1，then we can get a substring which length == 3, this is a trigram
So these all things are used to iteration all txt and generate trigrams.

Then ngram_counts is a table containing all trigrams, it looks like this table. It
just a example
‘he’ appears twice, others only appear once
Total is sum of all ngrams
Then probabilities = ngram_counts / total, is a table containing all the
probabilities of each trigrams . and then make a list , this is how does list
looks like.
n <- 3
generate_ngram_prob <- function(data, n) {
ngrams <- unlist(sapply(data$text, function(txt) {
sapply(seq(nchar(txt) - n + 1), function(i) substr(txt, i, i + n - 1))}))
ngram_counts <- table(ngrams)
total <- sum(ngram_counts)
probabilities <- ngram_counts / total
list(probabilities = probabilities, total = total)
}

Call the function above

# Generate separate n-gram models for ham and spam
ngram_ham <- generate_ngram_prob(train_ham, n)
ngram_spam <- generate_ngram_prob(train_spam, n)

# Prediction function
vocab_size represents the total number of possible n-grams.
It assumes that each character can take 40 different values (including letters,
numbers, some special symbol), there are a total of 40^n possible
combinations.
vocab_size <- 40^n

then is prediction function (containing these four parameters)

this ngrams is similar with above , generate n-gram, here I didn’t use unlist
this part because in prediction , we don’t need to use such table.
And then initialize log probabilities
Next is a loop, ng is current trigram, and this ngrams is which we created
above , it means that iteration of nagrams.
If ng in names(ngram_ham$probabilities) ngram_ham$probabilities just looks
like list in this example, and name is like ‘he ’’el’ ’ll’’lo’, if ng in this list , then
add log probabilities to log_prob_ham. Else , means current n-gram is not in
the list, out of list , then add these part log 0.5 divided by ngram_ham total +
0.5 times vocab_size. Which is add k smoothing , to prevent probability being
0, k =0.5
Next is same , if ng in spam list , then add probability, then add k smoothing
When iteration finished , then compare two log probabilities, if ham is
greater , the text is ham , else spam
predict_sms <- function(text, ngram_ham, ngram_spam, vocab_size) {
ngrams <- sapply(seq(nchar(text) - n + 1), function(i) substr(text, i, i + n -
1))

log_prob_ham <- 0
log_prob_spam <- 0
for (ng in ngrams) {
if (ng %in% names(ngram_ham$probabilities)) {
log_prob_ham <- log_prob_ham + log(ngram_ham$probabilities[ng])
} else {
log_prob_ham <- log_prob_ham + log(0.5 / (ngram_ham$total + 0.5 *
vocab_size))
}

if (ng %in% names(ngram_spam$probabilities)) {

log_prob_spam <- log_prob_spam + log(ngram_spam$probabilities[ng])
} else {
log_prob_spam <- log_prob_spam + log(0.5 / (ngram_spam$total + 0.5 *
vocab_size))
}
}

if (log_prob_ham > log_prob_spam) {

return("ham")
} else {
return("spam")
}
}

# Apply prediction function to the test data

sapply calls the predict_sms function for each text in the text column in
test_data.
predict_sms predicts whether each text message belongs to ham or spam and
returns the corresponding class label.
Here, := is data.table syntax for creating new columns or updating existing
columns.
predicted := creates a new column predicted and stores the predicted results
returned by sapply in that column.
test_data[, predicted := sapply(text, predict_sms, ngram_ham, ngram_spam,
vocab_size)]

and then I create confusion matrix and showing some metrics ,

here I create 2 CM because , in one confusion matrix contains only one class
of information
positive = “ham” , it contains information of ham , precision , recall ,f1,
positive = “spam” , contains information of spam.
# Confusion matrix and metrics
conf_matrix_ham <- confusionMatrix(as.factor(test_data$predicted),
as.factor(test_data$label), positive = "ham")
conf_matrix_spam <- confusionMatrix(as.factor(test_data$predicted),
as.factor(test_data$label), positive = "spam")
print(conf_matrix_ham$table)

# metrics
accuracy <- conf_matrix_ham$overall["Accuracy"]
precision_ham <- conf_matrix_ham$byClass["Precision"]
precision_spam <- conf_matrix_spam$byClass["Precision"]
recall_ham <- conf_matrix_ham$byClass["Recall"]
recall_spam <- conf_matrix_spam$byClass["Recall"]
f1_ham <- conf_matrix_ham$byClass["F1"]
f1_spam <- conf_matrix_spam$byClass["F1"]

# Calculate average metrics

macroaverage_precision <- mean(c(precision_ham, precision_spam))
macroaverage_recall <- mean(c(recall_ham, recall_spam))
macroaverage_f1 <- mean(c(f1_ham, f1_spam))

# Print metrics
# Create a data frame to store the metrics
metrics_df <- data.frame(
Class = c("Ham", "Spam", "Macroaverage"),
Precision = c(round(precision_ham, 4), round(precision_spam, 4),
round(macroaverage_precision, 4)),
Recall = c(round(recall_ham, 4), round(recall_spam, 4),
round(macroaverage_recall, 4)),
F1_Score = c(round(f1_ham, 4), round(f1_spam, 4), round(macroaverage_f1,
4))
)

# Print the metrics table

print(metrics_df)
cat("Accuracy:", accuracy, "\n")

why separate ham and spam train dataset

Spam text messages may contain more promotional words, such as "free",
"discount", and "win", while normal text messages are more likely to contain
words for daily communication. If ham and spam are not distinguished, the
model may find it difficult to capture these unique features, affecting the
detection effect.

Why a small k, k = 0.3 performance is worse?

When new n-gram combinations appear in the test data, since the probability
of these combinations is very low, the model may tend to predict them as
another category, or significantly reduce the overall probability of the entire
message. This leads to poor performance on the test set, which is called
overfitting.

Lanczos Algorithms For Large Symmetric Eigenvalue Computations - by Jane K. Cullum
No ratings yet
Lanczos Algorithms For Large Symmetric Eigenvalue Computations - by Jane K. Cullum
4 pages
Lab 78
No ratings yet
Lab 78
6 pages
Big data
No ratings yet
Big data
5 pages
DM chapter 3
No ratings yet
DM chapter 3
6 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
implemention of sms spam filtering
No ratings yet
implemention of sms spam filtering
27 pages
Spam Email Classification 3
No ratings yet
Spam Email Classification 3
8 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Order Tasks and Milestones Assignment
No ratings yet
Order Tasks and Milestones Assignment
6 pages
R Code NB
No ratings yet
R Code NB
3 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
CS 771 Assignment 2
No ratings yet
CS 771 Assignment 2
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Extending The Log Likelihood Measure To Improve Collocation Identification
No ratings yet
Extending The Log Likelihood Measure To Improve Collocation Identification
80 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
46 pages
ML Book
No ratings yet
ML Book
40 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
SVM_Lab_report (1)
No ratings yet
SVM_Lab_report (1)
7 pages
AIML ASSIGNMENT-2
No ratings yet
AIML ASSIGNMENT-2
8 pages
Coursera Course - Machine Learning - A Case Study Approach
No ratings yet
Coursera Course - Machine Learning - A Case Study Approach
25 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Simple Naive Bayes Classifier For Email Classification
No ratings yet
Simple Naive Bayes Classifier For Email Classification
5 pages
CSE 422 Machine Learning Probabilistic Methods
No ratings yet
CSE 422 Machine Learning Probabilistic Methods
28 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Ai Unit 3 Part 2
No ratings yet
Ai Unit 3 Part 2
8 pages
Kaligiti Sangeetha - 2211176 - Assignment 3
No ratings yet
Kaligiti Sangeetha - 2211176 - Assignment 3
14 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Natural Language Processing_Notes_Unit 2.docx
No ratings yet
Natural Language Processing_Notes_Unit 2.docx
19 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
Project Report
No ratings yet
Project Report
11 pages
cs188-fa22-note19
No ratings yet
cs188-fa22-note19
8 pages
Module3 Ids
No ratings yet
Module3 Ids
17 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Project Name Spam Email Detection 1
No ratings yet
Project Name Spam Email Detection 1
7 pages
Naive Bayes Classification - Jupyter Notebook
No ratings yet
Naive Bayes Classification - Jupyter Notebook
4 pages
NLP_Midterm_Spring2025
No ratings yet
NLP_Midterm_Spring2025
7 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
IR Pract
No ratings yet
IR Pract
7 pages
CS 471 HW 3 - Spam Detection
No ratings yet
CS 471 HW 3 - Spam Detection
6 pages
lm24aug
No ratings yet
lm24aug
84 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
PT 2
No ratings yet
PT 2
59 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Mail Type Spam Classifier: Abstarct
No ratings yet
Mail Type Spam Classifier: Abstarct
9 pages
Word Cloud
No ratings yet
Word Cloud
3 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
R Programming Solvw
No ratings yet
R Programming Solvw
45 pages
Taller de Arbol de Desicion y Random Forest
No ratings yet
Taller de Arbol de Desicion y Random Forest
7 pages
Implementation of N-Gram Technique
No ratings yet
Implementation of N-Gram Technique
6 pages
DWDM_pavan_final[1]
No ratings yet
DWDM_pavan_final[1]
10 pages
SS One Holiday Assignment
No ratings yet
SS One Holiday Assignment
16 pages
Study On Artificial Intelligence: The State of The Art and Future Prospects
No ratings yet
Study On Artificial Intelligence: The State of The Art and Future Prospects
30 pages
Ba Brochure 1665054643004
No ratings yet
Ba Brochure 1665054643004
8 pages
Hopcroft Original CS TR 71 190
No ratings yet
Hopcroft Original CS TR 71 190
16 pages
Unknown Load Factor-En
No ratings yet
Unknown Load Factor-En
5 pages
Mimpython 1
No ratings yet
Mimpython 1
6 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
7 pages
Time Alignment Measurement For Time Series
No ratings yet
Time Alignment Measurement For Time Series
12 pages
Assignment 04 ICT
No ratings yet
Assignment 04 ICT
4 pages
CBNST 1
No ratings yet
CBNST 1
34 pages
Video Steganography Using MATLAB
No ratings yet
Video Steganography Using MATLAB
7 pages
Tube-Based MPC
No ratings yet
Tube-Based MPC
9 pages
Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization
No ratings yet
Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization
42 pages
Kuantum Security
No ratings yet
Kuantum Security
39 pages
A Level Math Paper 2 Binomial Distribution
No ratings yet
A Level Math Paper 2 Binomial Distribution
14 pages
Information Security
No ratings yet
Information Security
15 pages
Database Management System
No ratings yet
Database Management System
4 pages
EJ743766
No ratings yet
EJ743766
3 pages
Dinic'S Algorithm: 郭至軒（ Kuoe0） Kuoe0.Ch
No ratings yet
Dinic'S Algorithm: 郭至軒（ Kuoe0） Kuoe0.Ch
75 pages
Pattern Recognition and Computer Vision Third Chinese Conference PRCV 2020 Nanjing China October 16 18 2020 Proceedings Part III Yuxin Peng download pdf
100% (4)
Pattern Recognition and Computer Vision Third Chinese Conference PRCV 2020 Nanjing China October 16 18 2020 Proceedings Part III Yuxin Peng download pdf
47 pages
DIGITAL AND LOGIC DESIGN-QUINE McCLUSKEY METHOD
No ratings yet
DIGITAL AND LOGIC DESIGN-QUINE McCLUSKEY METHOD
23 pages
Tracking Flight Control of Quadrotor Based On Disturbance Observer
No ratings yet
Tracking Flight Control of Quadrotor Based On Disturbance Observer
10 pages
Assignment 02
No ratings yet
Assignment 02
2 pages
Sentiment Analysis IMDB Review - Presentation
No ratings yet
Sentiment Analysis IMDB Review - Presentation
19 pages
Linear Programming Problem
100% (1)
Linear Programming Problem
27 pages
Linear and Digital Control Systems
No ratings yet
Linear and Digital Control Systems
2 pages
The Quantum Conspiracy: What Popularizers of QM Don't Want You To Know
No ratings yet
The Quantum Conspiracy: What Popularizers of QM Don't Want You To Know
68 pages
DSP-UNIT-5 Objective
No ratings yet
DSP-UNIT-5 Objective
5 pages
9.logic Programming and Prolog
No ratings yet
9.logic Programming and Prolog
25 pages

daima jieshi

Uploaded by

daima jieshi

Uploaded by

library(data.

sms_data[, text := tolower(text)] converts the text content to lowercase.

# Separate ham and spam

# then we generate n-grams , first let n = 3 , then create a function , unlist is

Call the function above

then is prediction function (containing these four parameters)

if (ng %in% names(ngram_spam$probabilities)) {

if (log_prob_ham > log_prob_spam) {

# Apply prediction function to the test data

and then I create confusion matrix and showing some metrics ,

# Calculate average metrics

# Print the metrics table

why separate ham and spam train dataset

Why a small k, k = 0.3 performance is worse?

You might also like