0% found this document useful (0 votes)
142 views16 pages

Lec-6 Spam-1

The document discusses spam detection using machine learning techniques. Specifically, it describes using a naïve Bayes classifier to classify emails as either spam or ham (not spam) based on the words contained. The naïve Bayes approach works by calculating the probability that an email belongs to each class based on the occurrence of individual words. It is claimed to have a high accuracy of 97% and advantages such as being self-learning, considering the full message, and being language independent. An example of how the probabilities are calculated is provided.

Uploaded by

Adish garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views16 pages

Lec-6 Spam-1

The document discusses spam detection using machine learning techniques. Specifically, it describes using a naïve Bayes classifier to classify emails as either spam or ham (not spam) based on the words contained. The naïve Bayes approach works by calculating the probability that an email belongs to each class based on the occurrence of individual words. It is claimed to have a high accuracy of 97% and advantages such as being self-learning, considering the full message, and being language independent. An example of how the probabilities are calculated is provided.

Uploaded by

Adish garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

SPAM Detection

SPAM
• Originating from the name of Hormel's canned meat,
• "spam" now also refers to junk e-mail or irrelevant postings to
a newsgroup or bulletin board.
• The unsolicited e-mail messages you receive about refinancing
your home, reversing aging, and losing those extra pounds are
all considered to be spam.
• Spamming other people is definitely not cool and is one of the
most notorious violations of Internet etiquette (or
"netiquette").
• So if you ever get the urge to let thousands of people know
about that hot new guaranteed way to make money on the
Internet, please reconsider.
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good/ham

• Naïve Bayes

3
Advantages of Bayesian Method
• Bayesian approach is self adapting. It keeps learning from the new
spams.
• Bayesian method takes whole message into account.
• Bayesian method is easy to use and very accurate (Claimed Accuracy
Percentage is 97).
• Bayesian approach is multi-lingual.
• Reduces the number of false positives.

4
A Spam Filter
Dear Sir.
• Naïve Bayes spam filter
First, I must solicit your confidence in this
transaction, this is by virture of its nature as
• Data: being utterly confidencial and top secret. …
– Collection of emails, labeled
spam or ham
TO BE REMOVED FROM FUTURE MAILINGS,
– Note: someone has to hand SIMPLY REPLY TO THIS MESSAGE AND PUT
label all this data! "REMOVE" IN THE SUBJECT.
– Split into training, testing
sets 99 MILLION EMAIL ADDRESSES
FOR ONLY $99

• Classifiers Ok, Iknow this is blatantly OT but I'm


– Learn on the training set beginning to go insane. Had an old Dell
– Test it on new emails Dimension XPS sitting in the corner and
decided to put it to use, I know it was working
pre being stuck in the corner, but when I
plugged it in, hit the power nothing
happened.
Later in time

Coming before or earlier


Discrete example
Separate spam from valid email, attributes=words
• D1: “send us your password” Spam
• D2: “send us your review” ham
• D3: “review your password” ham
• D4: “review us” spam
• D5: “send your password” spam
• D6: “send us your account” spam
Construct Vocabulary

spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account

Separate spam from valid email, attributes=words


P (spam)= 4/6 • D1: “send us your password” Spam
P (ham)= 2/6 • D2: “send us your review” ham
• D3: “review password” ham
• D4: “review us” spam
• D5: “send your password” spam
• D6: “send us your account” spam
Naïve Bayes
• Want P( spam | words)
• Use Bayes Rule: P(spam | words)  P( words | spam) P(spam)
P ( words)

P( words )  P( words | spam)  P( spam)  P( words | ham)  P( ham)

• Assume independence: probability of each word


independent of others
P( words | spam)  P( word1 | spam)  P(word 2 | spam)  ... P( wordn | spam)

14
Construct Vocabulary

spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account

P (spam)= 4/6 New email: “review us now”


P (ham)= 2/6

P(review us|spam) = P( 0,1,0,1,0,0| spam) = (1-2/4)(1/4)(1-3/4)(3/4)(1-3/4)(1-1/4)


P(review us|ham) = P( 0,1,0,1,0,0| ham) = (1-1/2)(2/2)(1-1/2)(1/2)(1-1/2)(1-1/2)
P( words | ham) P(ham)
P(ham | words) 
P( words)

P(ham|review us) = 0.0625*2/6 divide by


0.0625*2/6+ 0.0044*4/6
= 0.87

Is it correct!!!!

You might also like