0% found this document useful (0 votes)

32 views

4 NB 2024

Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin Chap 4: Naive Bayes, Text Classification, and Sentiment

Uploaded by

khcheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

4 NB 2024

Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin Chap 4: Naive Bayes, Text Classification, and Sentiment

Uploaded by

khcheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Text Classification

and Naïve Bayes

The Task of Text

Classification
Dan Jurafsky

Is this spam?
Dan Jurafsky

Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince New York

to ratify U.S Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using
Bayesian methods

James Madison Alexander Hamilton

Dan Jurafsky

Positive or negative movie review?

• unbelievably disappointing
• Full of zany characters and richly applied satire, and some
great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing
scenes.

4
Dan Jurafsky

What is the subject of this article?

Text Classification
• Assigning subject categories, topics, or genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Dan Jurafsky

Text Classification: definition

• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class c Î C

Dan Jurafsky
Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
Dan Jurafsky

Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d à c

9
Dan Jurafsky
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors

• …
Text The Naive Bayes Classifier
Classification
and Naive
Bayes
Naive Bayes Intuition

Simple ("naive") classification method based on

Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation
it
it it 6 6
I I I 5 5
I love
I loveI love
this this
movie!
this movie!
movie!It'sIt's It's
sweet, sweet,
sweet, thethe the 4 4
fairy always love it it it
fairy
butbut
but with with
satirical
with satirical
satiricalhumor. humor.
humor. The The
The fairy
it always lovelove
always to toto to to to 3 3
dialogue itit whimsical
whimsical it itit I and and 3 3
dialogue is great
dialogue is is great
andand
great and
the thethe whimsical
and seen areare I I and
adventure scenes are fun... and
and seen areanyone seen 2
adventure
adventure scenes
scenestoare are fun...
fun... friend seen anyone
anyone seen seen2
friend dialogue
It manages
It manages to be be whimsical
whimsical friendhappy happy dialogue
dialogue yetyet yet 1 1
It manages to be whimsical happy recommend would
andand romantic
romantic while
while laughing
laughing adventure
adventure recommend
recommend would 1 1
and romantic while laughing adventure sweet of satirical
satirical would 1
whimsical
at the conventions of of
at the conventions thethe whowho sweet of movie satirical
it it whimsical 1
at thefairy
conventions
fairy tale genre. of Ithewould who itsweet
I I but
to toof
movie
romantic it timeswhimsical
times 1 1
tale genre. I would it but movie
romantic I I
fairy recommend
tale genre. itI to
recommend would
it to just about several I to
yetyet sweet times1 1
just about it
several again but romantic
the humor I sweet
humor satirical
recommendanyone. I've
it to seen
just it several
itabout several again it yet
itthe satiricalsweet1 1
anyone. I've seen several thethe again seen humor
would adventure
anyone. times,
times,I'veand and I'mI'm
seen it always
alwaysseveral happy
happy to toscenesseen the
would
scenesI Iit the manages adventure 1 1
satirical
the genre
times,
to
to and
seeseeitI'm it again
again whenever
whenever
always happy I I fun
seen
thethe the manages
times
would genre 1 1
adventure
have a friend who hasn't fun toI Iscenes and I the
times andand
manages fairy
fairy 1 1
have
to see seen a
it again friend who
whenever I hasn't and the about
abouttimeswhile while humor genre 1
it yet! fun whenever haveand humor fairy 1
haveseen it yet!who hasn't
a friend whenever Iconventions
and about have have
have 1 1
conventions
with while
seen it yet! whenever
with greathumor 1 1
have great
conventions …… have……
with great
13
The bag of words representation
seen 2

γ( )=c
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Bayes’ Rule Applied to Documents and Classes

•For a document d and a class c

P(d | c)P(c)
P(c | d) =
P(d)
Naive Bayes Classifier (I)

MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class

P(d | c)P(c)
= argmax Bayes Rule

c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
Naive Bayes Classifier (II)
"Likelihood" "Prior"

cMAP = argmax P(d | c)P(c)

c∈C
Document d
represented as
= argmax P(x1, x2 ,…, xn | c)P(c) features
c∈C x1..xn
Naïve Bayes Classifier (IV)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C

O(|X|n•|C|) parameters How often does this

class occur?

Could only be estimated if a

We can just count the
very, very large number of relative frequencies in
training examples was a corpus

available.
Multinomial Naive Bayes Independence
Assumptions
P(x1, x2 ,…, xn | c)

Bag of Words assumption: Assume position doesn’t matter

Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.

P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)

Multinomial Naive Bayes Classifier

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C

cNB = argmax P(c j )∏ P(x | c)

c∈C x∈X
Applying Multinomial Naive Bayes Classifiers
to Text Classification

positions ¬ all word positions in test document

cNB = argmax P(c j )

c j ∈C
∏ P(xi | c j )
i∈ positions
Problems with multiplying lots of probs
There's a problem with this:

cNB = argmax P(c j )

c j ∈C
∏ P(xi | c j )
i∈ positions

Multiplying lots of probabilities can result in floating-point underflow!

.0006 * .0007 * .0009 * .01 * .5 * .000008….
Idea: Use logs, because log(ab) = log(a) + log(b)
We'll sum logs of probabilities instead of multiplying probabilities!
We actually do everything in log space
Instead of this: cNB = argmax P(c j ) ∏ P(xi | c j )
c j ∈C i∈ positions
<latexit sha1_base64="o0LQfSf3I3G0xas3oLJOwQZR0GU=">AAACoXicbVFdaxQxFM2MH63r16qPggQXoSIsMwWxL0JpfdAHyypuW5gMQyZ7ZzZ2koxJRnaJ+V/+Dt/8N2Z2R6itF0IO597Dvffcsm24sUnyO4pv3Lx1e2f3zujuvfsPHo4fPT41qtMM5kw1Sp+X1EDDJcwttw2ctxqoKBs4Ky+O+/zZd9CGK/nFrlvIBa0lrzijNlDF+CdZQEWorgVdOSKoXarWES3wlvJ+REqouXTwTVKt6dqPWOGIhZV1J0fe47d4UBeOFV8Jl/jYY9JAZbPwqRrP9gL/Er/CxHSicLwvCZ1KtXKtMrwfw3jv/xavCv6jFxDN66XNMZFKdqIETUAuLk1RjCfJNNkEvg7SAUzQELNi/IssFOsESMsaakyWJq3NHdWWswbCnp2BlrILWkMWoKQCTO42Dnv8IjALXCkdnrR4w15WOCqMWYsyVPYemqu5nvxfLutsdZA7LtvOgmTbRlXXYKtwfy684BqYbdYBUKaDWwyzJdWU2XDU3oT06srXwen+NH09TT7tTw6PBjt20VP0HO2hFL1Bh+g9mqE5YtGz6F30MTqJJ/GHeBZ/3pbG0aB5gv6JOPsD0yvRAA==</latexit>

2 3
X
This: cNB = argmax 4log P (cj ) + log P (xi |cj )5
cj 2C
i2positions

Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text The Naive Bayes Classifier
Classification
and Naive
Bayes
Text
Classification Naive Bayes: Learning
and Naïve
Bayes
Sec.13.3

Learning the Multinomial Naive Bayes Model

First attempt: maximum likelihood estimates

◦ simply use the frequencies in the data

𝑁"!
𝑃! 𝑐! =
𝑁#$#%&
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
Parameter estimation

P̂(wi | c j ) =
count(wi , c j ) fraction of times word wi appears
∑ count(w, c j ) among all words in documents of topic cj
w∈V

Create mega-document for topic j by concatenating all

docs in this topic
◦ Use frequency of w in mega-document
Sec.13.3

Problem with Maximum Likelihood

What if we have seen no training documents with the word fantastic

and classified in the topic positive (thumbs-up)?

count("fantastic", positive)
P̂("fantastic" positive) = = 0
∑ count(w, positive)
w∈V

Zero probabilities cannot be conditioned away, no matter the other

evidence!
cMAP = argmax c P̂(c)∏ P̂(xi | c)
i
Laplace (add-1) smoothing for Naïve Bayes

count(wi , c) +1
P̂(wi | c) =
∑ (count(w, c))+1)
w∈V

count(wi , c) +1
=
# &
%% ∑ count(w, c)(( + V
$ w∈V '
Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary

Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Textj ¬ single doc containing all docsj
docsj ¬ all docs with class =cj • For each word wk in Vocabulary
nk ¬ # of occurrences of wk in Textj
| docs j |
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocabulary?
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all!
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally helpful!
Stop words
Some systems ignore stop words
◦ Stop words: very frequent words like the and a.
◦ Sort the vocabulary by word frequency in training set
◦ Call the top 10 or 50 words the stopword list.
◦ Remove all stop words from both training and test sets
◦ As if they were never there!

But removing stop words doesn't usually help

• So in practice most NB algorithms use all words and don't
use stopword lists
Text
Classification Naive Bayes: Learning
and Naive
Bayes
Text Sentiment and Binary
Classification Naive Bayes
and Naive
Bayes
gative (-), and take the following miniature training and te
Let's do a worked sentiment example!
from actual movie reviews.
Cat Documents
Training - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
Nc
or P(c) for the two classes is computed via Eq. 4.11 as
g. We’ll Training -
use a sentimentjust plain boring
analysis domain with the Thetwo
wordclasses
with positive
doesn’t occur in the training set, so we drop
A worked sentiment example with add-1 smoothing
-
-
entirely
no
d from actual movie reviews.
predictable
surprises and
and
very
lacks
egative (-), and take the following miniature training
few
energy
andabove,
mentioned
laughs
test documents
we don’t use unknown word models for naive
lihoods from the training set for the remaining three words “pred
+ very powerful
“fun”, are as follows, from Eq. 4.14 (computing the probabilities
Cat
+ the most fun Documents
film of the summer 1. Prior from training:
Training
Test -? just with no fun of the words in the training set is left as an exercise for the reader
plain boring
predictable
- entirely predictable and lacks energy
prior P(c) for the two classes is computed via Eq. 4.11 as N : Nc
𝑃/ 𝑐 =
𝑁&!
1+1 P(-) = 3/5
- no surprises and very few laughs doc P(“predictable”|
%
𝑁)'(')*
= P(“predictable”|+) =
14 + 20 P(+) = 2/5
+ very powerful
3 2
P( ) = P(+) = 1+1 0+1
+ the most5 fun film of5the summer P(“no”| ) = P(“no”|+) =
word Test
with doesn’t? occur
predictable with no
in the training set,fun 2. (asDrop 14
so we drop it completely "with"
+ 20 9 + 20
ned above, we don’t use unknown word models for naive Bayes). The like- 0+1 1+1
rior3.
P(c) for the two classes is computed via Eq. 4.11 as Nc
: P(“fun”| ) = P(“fun”|+) =
from theLikelihoods fromthree
training set for the remaining training: Ndoc “no”, and
words “predictable”, 14 + 20 9 + 20
are as follows, from Eq. 4.14
𝑐𝑜𝑢𝑛𝑡 (computing
3 𝑤! , 𝑐 +the
1 probabilities for the
2 For the test remainder
sentence S = “predictable with no fun”, after removin
𝑝 𝑤 𝑐 = P( is)left P(+) =for the reader):
words in the training
! set
∑
= as an exercise
"∈$ 𝑐𝑜𝑢𝑛𝑡
5 𝑤, 𝑐 + |𝑉| 4. Scoring the test set:
5 the chosen class, via Eq. 4.9, is therefore computed as follows:
word with doesn’t occur 1 +the
in 1 training set, so we drop0it+completely
1 (as
P(“predictable”| ) = P(“predictable”|+) = 3 2⇥2⇥1 5
14 + 20 word models for naive9Bayes).
d above, we don’t use unknown + 20 P( like-
The )P(S| ) = ⇥ 3
= 6.1 ⇥ 10
1 1 0 1
5 34
rom the training set for the remaining three words “predictable”, “no”, and
+ +
P(“no”| ) = P(“no”|+) = 2 1⇥1⇥2
e as follows, from Eq. 4.14 14 +(computing
20 the probabilities
9 + 20 for the remainder
P(+)P(S|+) = ⇥ = 3.2 ⇥ 10 5
5 29 3
0 +as
rds in the training set is left 1 an exercise for the1reader):
+1
P(“fun”| ) = P(“fun”|+) =
14 + 20 9 + 20
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence seems to
be more important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
• From training corpus, extract Vocabulary
Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Text j ¬ duplicates
Remove single docincontaining
each doc: all docsj
docsj ¬ all docs with class =cj • For
• For eacheach word
word wktype w in docj
in Vocabulary
n• ¬Retain
# of only a single instance
occurrences of w inofText
w
| docs j | k k j
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation:

cNB = argmax P(c j )

c j ∈C
∏ P(wi | c j )
i∈ positions

39
without add-1 smoothing to make the differences clearer. Note that the results counts
need not be 1; the word great has a count of 2 even for Binary NB, because it appears
Binary multinominal naive Bayes
in multiple documents.

NB Binary
Counts Counts
Four original documents: + +
it was pathetic the worst part was the and 2 0 1 0
boxing scenes boxing 0 1 0 1
film 1 0 1 0
no plot twists or great scenes great 3 1 2 1
+ and satire and great plot twists it 0 1 0 1
+ great scenes great film no 0 1 0 1
or 0 1 0 1
After per-document binarization: part 0 1 0 1
it was pathetic the worst part boxing pathetic 0 1 0 1
plot 1 1 1 1
scenes satire 1 0 1 0
no plot twists or great scenes scenes 1 2 1 2
+ and satire great plot twists the 0 2 0 1
+ great scenes film twists 1 1 1 1
was 0 2 0 1
worst 0 1 0 1
without add-1 smoothing to make the differences clearer. Note that the results counts
need not be 1; the word great has a count of 2 even for Binary NB, because it appears
Binary multinominal naive Bayes
in multiple documents.

NB Binary
Counts Counts
Four original documents: + +
it was pathetic the worst part was the and 2 0 1 0
boxing scenes boxing 0 1 0 1
film 1 0 1 0
no plot twists or great scenes great 3 1 2 1
+ and satire and great plot twists it 0 1 0 1
+ great scenes great film no 0 1 0 1
or 0 1 0 1
After per-document binarization: part 0 1 0 1
it was pathetic the worst part boxing pathetic 0 1 0 1
plot 1 1 1 1
scenes satire 1 0 1 0
no plot twists or great scenes scenes 1 2 1 2
+ and satire great plot twists the 0 2 0 1
+ great scenes film twists 1 1 1 1
was 0 2 0 1
Counts can still be 2! Binarization is within-doc! worst 0 1 0 1
Text Sentiment and Binary
Classification Naive Bayes
and Naive
Bayes
Text More on Sentiment
Classification Classification
and Naive
Bayes
Sentiment Classification: Dealing with Negation
I really like this movie
I really don't like this movie

Negation changes the meaning of "like" to negative.

Negation can also change negative to positive-ish
◦ Don't dismiss this film
◦ Doesn't let us get bored
Sentiment Classification: Dealing with Negation
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In
Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. EMNLP-2002, 79—86.

Simple baseline method:

Add NOT_ to every word between negation and following punctuation:

didn’t like this movie , but I

didn’t NOT_like NOT_this NOT_movie but I

Sentiment Classification: Lexicons
Sometimes we don't have enough labeled training
data
In that case, we can make use of pre-built word lists
Called lexicons
There are various publically available lexicons
MPQA Subjectivity Cues Lexicon
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.

Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

6885 words from 8221 lemmas, annotated for intensity (strong/weak)
◦ 2718 positive
◦ 4912 negative
+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
− : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate

49
The General Inquirer
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General
Inquirer: A Computer Approach to Content Analysis. MIT Press

◦ Home page: http://www.wjh.harvard.edu/~inquirer

◦ List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
◦ Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
Categories:
◦ Positiv (1915 words) and Negativ (2291 words)
◦ Strong vs Weak, Active vs Passive, Overstated versus Understated
◦ Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
Free for Research Use
Using Lexicons in Sentiment Classification
Add a feature that gets a count whenever a word
from the lexicon occurs
◦ E.g., a feature called "this word occurs in the positive
lexicon" or "this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful,
wonderful) or negative words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of the
test set, dense lexicon features can help
Naive Bayes in Other tasks: Spam Filtering
SpamAssassin Features:
◦ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
◦ From: starts with many numbers
◦ Subject is all capitals
◦ HTML has a low ratio of text to image area
◦ "One hundred percent guaranteed"
◦ Claims you can be removed from the list
Naive Bayes in Language ID
Determining what language a piece of text is written in.
Features based on character n-grams do very well
Important to train on lots of varieties of each language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Summary: Naive Bayes is Not So Naive
Very Fast, low storage requirements
Work well with very small amounts of training data
Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results

Very good in domains with many equally important features

Decision Trees suffer from fragmentation in such cases – especially if little data

Optimal if the independence assumptions hold: If assumed

independence is correct, then it is the Bayes Optimal Classifier for problem
A good dependable baseline for text classification
◦ But we will see other classifiers that give better accuracy
Slide from Chris Manning
Text More on Sentiment
Classification Classification
and Naive
Bayes
Text Classification
and Naïve Bayes

Naïve Bayes:
Relationship to
Language Modeling
Dan Jurafsky

Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

57
Dan Jurafsky

Naïve Bayes and Language Modeling

• Naïve bayes classifiers can use any sort of feature
• URL, email address, dictionaries, network features
• But if, as in the previous slides
• We use only word features
• we use all of the words in the text (not a subset)
• Then
• Naïve bayes has an important similarity to language
58 modeling.
Dan Jurafsky Sec.13.2.1

Each class = a unigram language model

• Assigning each word: P(word | c)
• Assigning each sentence: P(s|c)=Π P(word|c)
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
Dan Jurafsky Sec.13.2.1

Naïve Bayes as a Language Model

• Which class assigns the higher probability to s?

Model pos Model neg

0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1

0.05 fun 0.005 fun

0.1 film 0.1 film P(s|pos) > P(s|neg)
Text Classification
and Naïve Bayes

Naïve Bayes:
Relationship to
Language Modeling
Text Precision, Recall, and F1
Classification
and Naive
Bayes
Evaluating Classifiers: How well does our
classifier work?
Let's first address binary classifiers:
• Is this email spam?
spam (+) or not spam (-)
• Is this post about Delicious Pie Company?
about Del. Pie Co (+) or not about Del. Pie Co(-)

We'll need to know

1. What did our classifier say about each email or post?
2. What should our classifier have said, i.e., the correct
answer, usually as defined by humans ("gold label")
First step in evaluation: The confusion matrix

gold standard labels

gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn
Accuracy on the confusion matrix
gold standard labels
gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn
Why don't we use accuracy?
Accuracy doesn't work well when we're dealing with
uncommon or imbalanced classes
Suppose we look at 1,000,000 social media posts to find
Delicious Pie-lovers (or haters)
• 100 of them talk about our pie
• 999,900 are posts about something unrelated
Imagine the following simple classifier
Every post is "not about pie"
Accuracy re: pie posts 100 posts are about pie; 999,900 aren't

gold standard labels

gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn
Why don't we use accuracy?
Accuracy of our "nothing is pie" classifier
999,900 true negatives and 100 false negatives
Accuracy is 999,900/1,000,000 = 99.99%!
But useless at finding pie-lovers (or haters)!!
Which was our goal!
Accuracy doesn't work well for unbalanced classes
Most tweets are not about pie!
Instead of accuracy we use precision and recall
gold standard labels
gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn

Precision: % of selected items that are correct

Recall: % of correct items that are selected
’s other
he why instead
999,900of areaccuracy we generally
tweets about somethingturn to two other
completely metrics
unrelated. showna in
Imagine
precision and recall. Precision measures the percentage of theclassifier
items that
classifier
m detected
Precision/Recall
that stupidly classified aren't
every tweet fooled
as “not aboutby the"just
pie”. This call
have 999,900(i.e.,
truethe systemand
negatives labeled
only as100positive) that arefor
false negatives in fact positive of
an accuracy (i.e.,
ive everything
according to the human negative"
gold labels).
0/1,000,000 or 99.99%! What an amazing accuracy
classifier!
Precisionlevel!
is defined
Surelyaswe should
py with this classifier? But of course this fabulous ‘no pie’ classifier would
Stupid classifier: Just
pletely useless, since it wouldn’t find say
true no:
positives
a single every tweet is "not
one of the customer comments about pie"
Precision =
100Intweets
looking• for. talk
other words, about
accuracy
true pie,
positives is +not999,900
a good
false tweets
metric
positives don't
when the goal is
over something that is rare, or at least not completely balanced in frequency,
all
s a measuresAccuracy
very• common = 999,900/1,000,000
the percentage
situation in of
theitems
world.actually = present
99.99%in the input that were
yt’s
identified
whyBut by
thethe
instead system.and
ofRecall
accuracy Recall is defined
we Precision
generally turn
foras
to this
two other metricsare
classifier shown in
terrible:
: precision and recall. Precision measures the percentage of the items that
em detected (i.e., the system labeled trueaspositives
positive) that are in fact positive (i.e.,
Recall =
itive according to the human true gold labels).
positives Precision
+ false is defined as
negatives

ision and recall will help solve true

the positives
problem with the useless “nothing is
Precision =
true positives
ssifier. This classifier, despite having+afalse positives
fabulous accuracy of 99.99%, has
e recall of 0 (since there are no true positives, and 100 false negatives, the
needs of an application. Values of b > 1 f
A combined measure: F1
precision. When b = 1, precision and reca
t frequently used metric,
F1 is a combination and
of precision andis called Fb =1
recall.

2PR
F1 =
P+R
from a weighted harmonic mean of precisio
et of numbers is the reciprocal of the arithme
anced;
F1anced; thisthis is the
is the most
most frequentlyused
frequently usedmetric,
metric,and
and is
is called
called FFbb=1
=1 or
orjust
justFF1 :1 :

F1 is a special case ofF the

F 1== general
2PR
2PR "F-measure" (4.16)
(4.16)
1
PP++RR
F-measure
F-measure comes
comes from
from a weightedharmonic
a weighted harmonic mean
mean of
of precision
precisionand andrecall.
recall.The
The
F-measure
harmonic mean ofis the
a set (weighted)
of numbers harmonic
is the reciprocal of the mean of
arithmetic mean of recip-
harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of recip-
precision and recall
rocals:
rocals:
n
HarmonicMean(a1 , a2 , a3 , a4 , ..., an ) = 1 1
n1 1
(4.17)
HarmonicMean(a1 , a2 , a3 , a4 , ..., an ) = 1a1 + a12 + a13 + ... + an1 (4.17)
a1 + a2 + a3 + ... + an
and hence F-measure is
and hence F-measure is ✓ ◆
1 ✓ with b 2 = 1 a◆ (b 2 + 1)PR
F= 1 1 or F = (b 22 + 1)PR (4.18)
F = 1a P + (1 a) R1 or with b 2 =
1 aa F = b P+R (4.18)
a P + (1 a) R1 a b 2P + R

F1 is a special case of F-measure with β=1, α=½

Suppose we have more than 2 classes?
Lots of text classification tasks have more than two classes.
◦ Sentiment analysis (positive, negative, neutral) , named entities (person, location, organization)

We can define precision and recall for multiple classes like this 3-way email task:

gold labels
urgent normal spam
8
urgent 8 10 1 precisionu=
8+10+1
system 60
output normal 5 60 50 precisionn=
5+60+50
200
spam 3 30 200 precisions=
3+30+200
recallu = recalln = recalls =
8 60 200
8+5+3 10+60+30 1+50+200
How to combine P/R values for different classes:
Microaveraging vs Macroaveraging

Class 1: Urgent Class 2: Normal Class 3: Spam Pooled

true true true true true true true true
urgent not normal not spam not yes no
system system system system
urgent 8 11 normal 60 55 spam 200 33 yes 268 99
system system system system
not 8 340 not 40 212 not 51 83 no 99 635
8 60 200 microaverage = 268
precision = = .42 precision = = .52 precision = = .86 = .73
8+11 60+55 200+33 precision 268+99

macroaverage = .42+.52+.86
= .60
precision 3
Text Precision, Recall, and F1
Classification
and Naive
Bayes
Text Avoiding Harms in Classification
Classification
and Naive
Bayes
Harms of classification
Classifiers, like any NLP algorithm, can cause harms
This is true for any classifier, whether Naive Bayes or
other algorithms
Representational Harms
• Harms caused by a system that demeans a social group
• Such as by perpetuating negative stereotypes about them.
• Kiritchenko and Mohammad 2018 study
• Examined 200 sentiment analysis systems on pairs of sentences
• Identical except for names:
• common African American (Shaniqua) or European American (Stephanie).
• Like "I talked to Shaniqua yesterday" vs "I talked to Stephanie yesterday"
• Result: systems assigned lower sentiment and more negative
emotion to sentences with African American names
• Downstream harm:
• Perpetuates stereotypes about African Americans
• African Americans treated differently by NLP tools like sentiment (widely
used in marketing research, mental health studies, etc.)
Harms of Censorship
• Toxicity detection is the text classification task of detecting hate speech,
abuse, harassment, or other kinds of toxic language.
• Widely used in online content moderation
• Toxicity classifiers incorrectly flag non-toxic sentences that simply
mention minority identities (like the words "blind" or "gay")
• women (Park et al., 2018),
• disabled people (Hutchinson et al., 2020)
• gay people (Dixon et al., 2018; Oliva et al., 2021)
• Downstream harms:
• Censorship of speech by disabled people and other groups
• Speech by these groups becomes less visible online
• Writers might be nudged by these algorithms to avoid these words
making people less likely to write about themselves or these groups.
Performance Disparities
1. Text classifiers perform worse on many languages of
the world due to lack of data or labels
2. Text classifiers perform worse on varieties of even
high-resource languages like English
• Example task: language identification, a first step in NLP
pipeline ("Is this post in English or not?")
• English language detection performance worse for writers
who are African American (Blodgett and O'Connor 2017)
or from India (Jurgens et al., 2017)
Harms in text classification
• Causes:
• Issues in the data; NLP systems amplify biases in training data
• Problems in the labels
• Problems in the algorithms (like what the model is trained to
optimize)
• Prevalence: The same problems occur throughout NLP
(including large language models)
• Solutions: There are no general mitigations or solutions
• But harm mitigation is an active area of research
• And there are standard benchmarks and tools that we can use
for measuring some of the harms
Text Avoiding Harms in Classification
Classification
and Naive
Bayes

Methodology For Installation of Paver Block Work
56% (9)
Methodology For Installation of Paver Block Work
3 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
20250129_Lecture03_naivebayes
No ratings yet
20250129_Lecture03_naivebayes
25 pages
MultinomialNB
No ratings yet
MultinomialNB
52 pages
nb24aug
No ratings yet
nb24aug
85 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
04_1 06 naivebayes
No ratings yet
04_1 06 naivebayes
65 pages
Multimedia Application L7_for
No ratings yet
Multimedia Application L7_for
46 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
4.Machine Learning for Text Understanding-1
No ratings yet
4.Machine Learning for Text Understanding-1
45 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Naïve Bayes: The Task of Text Classification
No ratings yet
Naïve Bayes: The Task of Text Classification
34 pages
04 Textcat
No ratings yet
04 Textcat
101 pages
02 Text Processing PDF
No ratings yet
02 Text Processing PDF
70 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
nb24aug
No ratings yet
nb24aug
79 pages
04-textcat text class
No ratings yet
04-textcat text class
77 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
48 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Slp3 TextClassification Reduced
No ratings yet
Slp3 TextClassification Reduced
60 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
NLP NB
No ratings yet
NLP NB
52 pages
Text Classification and Naïve Bayes: The Task of Text Classifica1on
No ratings yet
Text Classification and Naïve Bayes: The Task of Text Classifica1on
74 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Text Classification
No ratings yet
Text Classification
53 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
NLP ch4 l1
No ratings yet
NLP ch4 l1
23 pages
Nlp4web Lecture 2 Text Classification
No ratings yet
Nlp4web Lecture 2 Text Classification
109 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Text Categorization and Classification
No ratings yet
Text Categorization and Classification
13 pages
Inf2b Learn Note07 2up
No ratings yet
Inf2b Learn Note07 2up
5 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
Statistics
No ratings yet
Statistics
25 pages
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
Naive Bayes and Sentiment
No ratings yet
Naive Bayes and Sentiment
19 pages
An Approach of The Naive Bayes Classifier For The Document Classification
No ratings yet
An Approach of The Naive Bayes Classifier For The Document Classification
4 pages
Lecture 5-1 Naive
No ratings yet
Lecture 5-1 Naive
44 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Lecture-Feb20&25
No ratings yet
Lecture-Feb20&25
11 pages
in4080_2022_lecture_03
No ratings yet
in4080_2022_lecture_03
62 pages
7 - Text Classification Naive Bayes
No ratings yet
7 - Text Classification Naive Bayes
41 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Module 3 NLP
No ratings yet
Module 3 NLP
17 pages
ML Unit No.4 Naïve Bayes Classifiers PPT Notes
No ratings yet
ML Unit No.4 Naïve Bayes Classifiers PPT Notes
47 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Lecture 12 Dr. Lamiaa
No ratings yet
Lecture 12 Dr. Lamiaa
21 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
No ratings yet
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
8 pages
3. Text Classification
No ratings yet
3. Text Classification
60 pages
Document
No ratings yet
Document
7 pages
NBayes-1-20-2011-ann
No ratings yet
NBayes-1-20-2011-ann
21 pages
24 Shivangi DMDW
No ratings yet
24 Shivangi DMDW
12 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
The Little Poetry Book about Loving Pugs
From Everand
The Little Poetry Book about Loving Pugs
Walter the Educator
No ratings yet
The Little Poetry Book about Loving German Shorthaired Pointers
From Everand
The Little Poetry Book about Loving German Shorthaired Pointers
Walter the Educator
No ratings yet
1001.0088v2
No ratings yet
1001.0088v2
18 pages
9601001v2
No ratings yet
9601001v2
8 pages
1001.0076v1
No ratings yet
1001.0076v1
4 pages
2412.14215v1
No ratings yet
2412.14215v1
16 pages
1001.0050v1
No ratings yet
1001.0050v1
10 pages
1001.0068v1
No ratings yet
1001.0068v1
8 pages
2504.00633v1
No ratings yet
2504.00633v1
8 pages
9601001v1 (1)
No ratings yet
9601001v1 (1)
14 pages
9601002v1
No ratings yet
9601002v1
12 pages
9601003v1
No ratings yet
9601003v1
34 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
2504.06216v1
No ratings yet
2504.06216v1
22 pages
Kdd2014 Domingos Scale Modeling 01
No ratings yet
Kdd2014 Domingos Scale Modeling 01
52 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
Development and Application of A Chemical Profiling Method For The Assessment of The Quality and Consistency of The Pelargonium Sidoides Extract
No ratings yet
Development and Application of A Chemical Profiling Method For The Assessment of The Quality and Consistency of The Pelargonium Sidoides Extract
10 pages
Beyond The Hype Capturing The Potential of Ai and Gen Ai in TMT
100% (1)
Beyond The Hype Capturing The Potential of Ai and Gen Ai in TMT
126 pages
Eccv2014 Mathias Face Detection 01
No ratings yet
Eccv2014 Mathias Face Detection 01
66 pages
Enhancing Geometric Representations For Molecules With Equivariant Vector-Scalar Interactive Message Passing
No ratings yet
Enhancing Geometric Representations For Molecules With Equivariant Vector-Scalar Interactive Message Passing
13 pages
1 3
No ratings yet
1 3
3 pages
Eccv2014 Zeiler Convolutional Networks 01
No ratings yet
Eccv2014 Zeiler Convolutional Networks 01
39 pages
Single Atom Catalysts Push The Boundaries of Heterogeneous Catalysis
No ratings yet
Single Atom Catalysts Push The Boundaries of Heterogeneous Catalysis
2 pages
2005 TJ Powertrain
100% (1)
2005 TJ Powertrain
502 pages
12th Computer Science Public Exam 2023 Model Question Paper English Medium and Tamil Medium PDF Download
No ratings yet
12th Computer Science Public Exam 2023 Model Question Paper English Medium and Tamil Medium PDF Download
4 pages
Network Infoddd
No ratings yet
Network Infoddd
9 pages
Abstract for Facial Emotion Detection Using Neural Networks
No ratings yet
Abstract for Facial Emotion Detection Using Neural Networks
48 pages
Split Tensile Strength of Brick Masonry
No ratings yet
Split Tensile Strength of Brick Masonry
8 pages
FISCAL METERING SYSTEM SPECIFICATION
No ratings yet
FISCAL METERING SYSTEM SPECIFICATION
27 pages
Heat Transfer Oil - Total Seriola K 3120 Spec
No ratings yet
Heat Transfer Oil - Total Seriola K 3120 Spec
3 pages
A) Immediate Use of Post Immediately After RCT Has Resulted in Decrease Apical Leakage
No ratings yet
A) Immediate Use of Post Immediately After RCT Has Resulted in Decrease Apical Leakage
6 pages
Cummins: Fault Code: 529 PID: S051 SPN: 703 FMI: 3
No ratings yet
Cummins: Fault Code: 529 PID: S051 SPN: 703 FMI: 3
5 pages
2016 FS PhySci GR 11 Jun Exam Eng
No ratings yet
2016 FS PhySci GR 11 Jun Exam Eng
18 pages
Review Statistik (Simple Linear and Correlation)
No ratings yet
Review Statistik (Simple Linear and Correlation)
21 pages
Digital Electronics Project
No ratings yet
Digital Electronics Project
13 pages
Tutorial 5
No ratings yet
Tutorial 5
7 pages
Exam in Automatic Control II Reglerteknik II 5hp: Good Luck!
No ratings yet
Exam in Automatic Control II Reglerteknik II 5hp: Good Luck!
10 pages
Petroleum Coke Category Analysis and Hazard Characterization
No ratings yet
Petroleum Coke Category Analysis and Hazard Characterization
40 pages
Earnings Management and Cash Holdings
No ratings yet
Earnings Management and Cash Holdings
15 pages
Motion of Particles Through Fluids
No ratings yet
Motion of Particles Through Fluids
9 pages
Problem 1
No ratings yet
Problem 1
2 pages
Some Common Probability Distributions
No ratings yet
Some Common Probability Distributions
92 pages
Software Testing Exam Paper 1
No ratings yet
Software Testing Exam Paper 1
2 pages
Grade VIII Mid Year Examination 2024-25
No ratings yet
Grade VIII Mid Year Examination 2024-25
3 pages
Seminar Worksheet. Measures of Dispersion 2
No ratings yet
Seminar Worksheet. Measures of Dispersion 2
5 pages
PCS Algebra I Quiz 4 Chapter Review KEY
100% (1)
PCS Algebra I Quiz 4 Chapter Review KEY
1 page
Crash 2023 08 26 - 14.14.05 Client
No ratings yet
Crash 2023 08 26 - 14.14.05 Client
7 pages
Dokumen - Tips - Fiat Kobelco Ex455 Excavator Service Repair Manual
100% (1)
Dokumen - Tips - Fiat Kobelco Ex455 Excavator Service Repair Manual
22 pages
SBBSB
No ratings yet
SBBSB
130 pages
Designing A Class A Power Amplifier Using The Load-Pull Method
No ratings yet
Designing A Class A Power Amplifier Using The Load-Pull Method
15 pages
Python Journal
No ratings yet
Python Journal
64 pages
Pages From 3dsmax 2010 Animation
No ratings yet
Pages From 3dsmax 2010 Animation
42 pages

4 NB 2024

Uploaded by

4 NB 2024

Uploaded by

Text Classification

and Naïve Bayes

The Task of Text

Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince New York

James Madison Alexander Hamilton

Positive or negative movie review?

What is the subject of this article?

MEDLINE Article MeSH Subject Category Hierarchy

Text Classification: definition

• Output: a predicted class c Î C

Simple ("naive") classification method based on

•For a document d and a class c

cMAP = argmax P(d | c)P(c)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

O(|X|n•|C|) parameters How often does this

Could only be estimated if a

Bag of Words assumption: Assume position doesn’t matter

P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

cNB = argmax P(c j )∏ P(x | c)

positions ¬ all word positions in test document

cNB = argmax P(c j )

cNB = argmax P(c j )

Multiplying lots of probabilities can result in floating-point underflow!

Learning the Multinomial Naive Bayes Model

First attempt: maximum likelihood estimates

Create mega-document for topic j by concatenating all

Problem with Maximum Likelihood

What if we have seen no training documents with the word fantastic

Zero probabilities cannot be conditioned away, no matter the other

• From training corpus, extract Vocabulary

But removing stop words doesn't usually help

cNB = argmax P(c j )

Negation changes the meaning of "like" to negative.

Simple baseline method:

didn’t like this movie , but I

didn’t NOT_like NOT_this NOT_movie but I

Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

◦ Home page: http://www.wjh.harvard.edu/~inquirer

Very good in domains with many equally important features

Optimal if the independence assumptions hold: If assumed

Generative Model for Multinomial Naïve Bayes

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Naïve Bayes and Language Modeling

Each class = a unigram language model

Naïve Bayes as a Language Model

Model pos Model neg

0.05 fun 0.005 fun

We'll need to know

gold standard labels

gold standard labels

Precision: % of selected items that are correct

ision and recall will help solve true

F1 is a special case ofF the

F1 is a special case of F-measure with β=1, α=½

Class 1: Urgent Class 2: Normal Class 3: Spam Pooled

You might also like