0% found this document useful (0 votes)
4 views

daima jieshi

The document outlines a process for building a spam detection model using n-grams from SMS data. It includes data preprocessing, dataset splitting, n-gram generation, and a prediction function that utilizes log probabilities for classification. Additionally, it discusses the importance of separating ham and spam messages for effective model training and the impact of smoothing parameters on model performance.

Uploaded by

mlin41088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

daima jieshi

The document outlines a process for building a spam detection model using n-grams from SMS data. It includes data preprocessing, dataset splitting, n-gram generation, and a prediction function that utilizes log probabilities for classification. Additionally, it discusses the importance of separating ham and spam messages for effective model training and the impact of smoothing parameters on model performance.

Uploaded by

mlin41088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

library(data.

table)
library(caTools)
library(caret)

# Read data
sms_data <- fread("E:/2024/AI/ML/project2/SMS/SMSSpamCollection", header
= FALSE, sep = "\t", quote = "")
colnames(sms_data) <- c("label", "text")

sms_data[, text := tolower(text)] converts the text content to lowercase.


sms_data[, text := paste0("{", text, "}")] By adding markers at the beginning
and end of the text, it allow the model to better identify where each message
begins and ends.

# Split dataset 80% for train data , 20% for test data.
set.seed(202)
split <- sample.split(sms_data$label, SplitRatio = 0.8)
train_data <- sms_data[split, ]
test_data <- sms_data[!split, ]

# Separate ham and spam


train_ham <- train_data[label == "ham", ]
train_spam <- train_data[label == "spam", ]

# then we generate n-grams , first let n = 3 , then create a function , unlist is


for expanding the list of trigrams generated for each text into a single vector.
sapply processes each txt one by one.
For each txt, the internal function(txt) {...} is executed to generate trigrams
of the text.
The internal function seq(nchar(txt) - n + 1) seq is used to generate a
sequence, indicating the starting position index from which n-grams can be
extracted.
nchar(txt) means how many character of the txt.
nchar(txt) - n + 1 is the last starting point that allows the extraction of
trigrams, which ensures that each trigram is complete.
And then , the function (i) , i represents the starting points of each trigrams,
the function is extract substring from the txt , the substring is from i to i+n-
1,then we can get a substring which length == 3, this is a trigram
So these all things are used to iteration all txt and generate trigrams.

Then ngram_counts is a table containing all trigrams, it looks like this table. It
just a example
‘he’ appears twice, others only appear once
Total is sum of all ngrams
Then probabilities = ngram_counts / total, is a table containing all the
probabilities of each trigrams . and then make a list , this is how does list
looks like.
n <- 3
generate_ngram_prob <- function(data, n) {
ngrams <- unlist(sapply(data$text, function(txt) {
sapply(seq(nchar(txt) - n + 1), function(i) substr(txt, i, i + n - 1))}))
ngram_counts <- table(ngrams)
total <- sum(ngram_counts)
probabilities <- ngram_counts / total
list(probabilities = probabilities, total = total)
}

Call the function above


# Generate separate n-gram models for ham and spam
ngram_ham <- generate_ngram_prob(train_ham, n)
ngram_spam <- generate_ngram_prob(train_spam, n)

# Prediction function
vocab_size represents the total number of possible n-grams.
It assumes that each character can take 40 different values (including letters,
numbers, some special symbol), there are a total of 40^n possible
combinations.
vocab_size <- 40^n

then is prediction function (containing these four parameters)


this ngrams is similar with above , generate n-gram, here I didn’t use unlist
this part because in prediction , we don’t need to use such table.
And then initialize log probabilities
Next is a loop, ng is current trigram, and this ngrams is which we created
above , it means that iteration of nagrams.
If ng in names(ngram_ham$probabilities) ngram_ham$probabilities just looks
like list in this example, and name is like ‘he ’’el’ ’ll’’lo’, if ng in this list , then
add log probabilities to log_prob_ham. Else , means current n-gram is not in
the list, out of list , then add these part log 0.5 divided by ngram_ham total +
0.5 times vocab_size. Which is add k smoothing , to prevent probability being
0, k =0.5
Next is same , if ng in spam list , then add probability, then add k smoothing
When iteration finished , then compare two log probabilities, if ham is
greater , the text is ham , else spam
predict_sms <- function(text, ngram_ham, ngram_spam, vocab_size) {
ngrams <- sapply(seq(nchar(text) - n + 1), function(i) substr(text, i, i + n -
1))

log_prob_ham <- 0
log_prob_spam <- 0
for (ng in ngrams) {
if (ng %in% names(ngram_ham$probabilities)) {
log_prob_ham <- log_prob_ham + log(ngram_ham$probabilities[ng])
} else {
log_prob_ham <- log_prob_ham + log(0.5 / (ngram_ham$total + 0.5 *
vocab_size))
}

if (ng %in% names(ngram_spam$probabilities)) {


log_prob_spam <- log_prob_spam + log(ngram_spam$probabilities[ng])
} else {
log_prob_spam <- log_prob_spam + log(0.5 / (ngram_spam$total + 0.5 *
vocab_size))
}
}

if (log_prob_ham > log_prob_spam) {


return("ham")
} else {
return("spam")
}
}

# Apply prediction function to the test data


sapply calls the predict_sms function for each text in the text column in
test_data.
predict_sms predicts whether each text message belongs to ham or spam and
returns the corresponding class label.
Here, := is data.table syntax for creating new columns or updating existing
columns.
predicted := creates a new column predicted and stores the predicted results
returned by sapply in that column.
test_data[, predicted := sapply(text, predict_sms, ngram_ham, ngram_spam,
vocab_size)]

and then I create confusion matrix and showing some metrics ,


here I create 2 CM because , in one confusion matrix contains only one class
of information
positive = “ham” , it contains information of ham , precision , recall ,f1,
positive = “spam” , contains information of spam.
# Confusion matrix and metrics
conf_matrix_ham <- confusionMatrix(as.factor(test_data$predicted),
as.factor(test_data$label), positive = "ham")
conf_matrix_spam <- confusionMatrix(as.factor(test_data$predicted),
as.factor(test_data$label), positive = "spam")
print(conf_matrix_ham$table)

# metrics
accuracy <- conf_matrix_ham$overall["Accuracy"]
precision_ham <- conf_matrix_ham$byClass["Precision"]
precision_spam <- conf_matrix_spam$byClass["Precision"]
recall_ham <- conf_matrix_ham$byClass["Recall"]
recall_spam <- conf_matrix_spam$byClass["Recall"]
f1_ham <- conf_matrix_ham$byClass["F1"]
f1_spam <- conf_matrix_spam$byClass["F1"]

# Calculate average metrics


macroaverage_precision <- mean(c(precision_ham, precision_spam))
macroaverage_recall <- mean(c(recall_ham, recall_spam))
macroaverage_f1 <- mean(c(f1_ham, f1_spam))

# Print metrics
# Create a data frame to store the metrics
metrics_df <- data.frame(
Class = c("Ham", "Spam", "Macroaverage"),
Precision = c(round(precision_ham, 4), round(precision_spam, 4),
round(macroaverage_precision, 4)),
Recall = c(round(recall_ham, 4), round(recall_spam, 4),
round(macroaverage_recall, 4)),
F1_Score = c(round(f1_ham, 4), round(f1_spam, 4), round(macroaverage_f1,
4))
)

# Print the metrics table


print(metrics_df)
cat("Accuracy:", accuracy, "\n")

why separate ham and spam train dataset


Spam text messages may contain more promotional words, such as "free",
"discount", and "win", while normal text messages are more likely to contain
words for daily communication. If ham and spam are not distinguished, the
model may find it difficult to capture these unique features, affecting the
detection effect.

Why a small k, k = 0.3 performance is worse?


When new n-gram combinations appear in the test data, since the probability
of these combinations is very low, the model may tend to predict them as
another category, or significantly reduce the overall probability of the entire
message. This leads to poor performance on the test set, which is called
overfitting.

You might also like