0% found this document useful (0 votes)
9 views

P1-1

This document provides a comprehensive overview of Bayes' theorem and its application in Bayesian statistics and machine learning, particularly through the Naive Bayes Classifier. It explains foundational concepts of probability, including conditional probability, the total probability theorem, and the derivation of Bayes' theorem, along with practical examples and pitfalls in implementation. The article serves as both a tutorial and a reference for readers from various backgrounds, emphasizing the importance of Bayesian reasoning in scientific inquiry.

Uploaded by

S Karthikeyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

P1-1

This document provides a comprehensive overview of Bayes' theorem and its application in Bayesian statistics and machine learning, particularly through the Naive Bayes Classifier. It explains foundational concepts of probability, including conditional probability, the total probability theorem, and the derivation of Bayes' theorem, along with practical examples and pitfalls in implementation. The article serves as both a tutorial and a reference for readers from various backgrounds, emphasizing the importance of Bayesian reasoning in scientific inquiry.

Uploaded by

S Karthikeyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1 INTRODUCTION

Bayes’ theorem is of fundamental importance for inferential statistics and many


advanced machine learning models. Bayesian reasoning is a logical approach to updating
the probability of hypotheses in the light of new evidence , and it therefore rightly palys
pivotal role in science [1]. Bayesian analysis allows us to answer questions for which fre-
quenties statistical approaches were not developed. In fact , that very idea of assigning
a probability to a hypotheses is not part of the frequentist paradigm. The goal of this
article is to provide both a mathematically rigorous yet concise explanation of the foun-
dation of Bayesian Statistics : Bayes’ theorem, which underpins a simple but powerful
machine learning algorithm : The Navie Bayes Classifier [2]. In contrast to other texts
on these topics , this article is self- contained : it explains all terms and notations in de-
tails and provides illustrative examples. As a tutorial, this text should therefore be easily
accessible to readers from various backgrounds. As an encyclopedic article, it provides
a complete reference for bioinformaticians, machine learners, and statisticians . Readers
who are already familiar with the statistical background may find the practical examples
in section 3 most useful. Specifically, section 3 highlights some caveats and pitfalls ( and
how to avoid them) in building a navie Bayes classifier using R, with additional materials
available at the accompanying website http://osf.io/92mes. To code the given text in
LaTeX, here is the representation:

2 Fundamentals
2.1 Basic Notation and Concepts
A statistical experiment can be broadly defined as a process that results in one
and only one of several possible outcomes. The collection of all possible outcomes is
called the sample space, denoted by Ω. At the introductory level, we can describe events
by using notation from set theory. For example, our experiment may be one roll of a fair
die. The sample space is then Ω = {1, 2, 3, 4, 5, 6}, which is also referred to as universal
set in the terminology of set theory. A simple event is, for instance, the outcome 2, which
we denote as E1 = {2}. The probability of an event is denoted by P (·). According to the
classic concept of probability, the probability of an event E is the number of outcomes
that are favorable to this event, divided by the total number of possible outcomes for the
experiment:
|E|
P (E) = ,
|Ω|
where |E| denotes the cardinality of the set E, i.e., the number of elements in E. In our
example, the probability of rolling a 2 is

|E1 | 1
P (E1 ) = = .
|Ω| 6

The event “the number is even” is a compound event, denoted by E2 = {2, 4, 6}. The
cardinality of E2 is 3, so the probability of this event is
3
P (E2 ) = .
6

1
The complement of E is the event that E does not occur and is denoted by E c , with
P (E c ) = 1 − P (E). In the example, E2c = {1, 3, 5}. Furthermore, P (A|B) denotes the
conditional probability of A given B. Finally, ∅ denotes the empty set, i.e., ∅ = {}.
Let A and B be two events from a sample space Ω, which is either finite with N
elements or countably infinite. Let P : Ω → [0, 1] be a probability distribution on Ω,
such that 0 < P (A) < 1 and 0 < P (B) < 1 and, obviously, P (Ω) = 1. We can represent
these events in a Venn diagram (Fig. 1a). The union of the events A and B, denoted by
A ∪ B, is the event that either A or B or both occur. The intersection of the events A
and B, denoted by A ∩ B, is the event that both A and B occur. Finally, two events, A
and B, are called mutually exclusive if the occurrence of one of these events rules out the
possibility of occurrence of the other event. In the notation of set theory, this means that
A and B are disjoint, i.e., A∩B = ∅. Two events A and B, with P (A) > 0 and P (B) > 0,
are called independent if the occurrence of one event does not affect the probability of
occurrence of the other event, i.e.,

P (A ∩ B) = P (A) · P (B).

Note that the conditional probability, P (A|B), is the joint probability P (A∩B) divided
by the marginal probability P (B). This is a fundamental relation, which has a simple
geometrical interpretation. Loosely speaking, given that we are in the ellipse B (Fig.
1a), what is the probability that we are also in A? To also be in A, we have to be in the
intersection A ∩ B. Hence, the probability is the number of elements in the intersection,
|A ∩ B|, divided by the number of elements in B, i.e., |B|. Formally,

|A ∩ B|/|Ω| P (A ∩ B)
P (A|B) = = .
|B|/|Ω| P (B)

Figure 1.1: (a) Venn diagram for sets A and B .(b) Illustration of the total probability
theorem.The sample sapceΩ is divided into five disjoint sets A1 to A5 , which partly
overlap with set B.

3 Total Probability Theorem


Before deriving Bayes’ theorem,it is usefu to consider the Total probability theorem. First
, the addition rule for two events, A and B , is easily derived from Figure 1a.

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) (1)

2
We assume that the sample space can be divided into n mutually exclusive events
Ai ,i=1. . . n, as shown in Figure 1(b). Specifically,

1. A1 ∪ A2 ∪ A3 ∪ · · · ∪ An = Ω

2. Ai ∩ Aj = ∅ fori ̸= j

3. Ai ̸= ∅

From fig 1(b), it is obvious that B can be stated as


B = (B ∩ A1 ) ∪ (B ∩ A2 ) ∪ · · · ∪ (B ∩ An )
and we obtain the total probability thorem as
n
X
P (B) = P (B | Ai )P (Ai ) (2)
i=1

which can further be rewritten as:

P (B) = P (B | A)P (A) + P (B | Ac )P (Ac ) (3)

because A2 ∪ A3 ∪ · · · ∪ An is the complement of A1 (cf. conditions 1 and 2 above).


Redefining Ac = A2 ∪ A3 ∪ · · · ∪ An as A′ , we obtain Equation (3).

4 Bayes’ Theorem
Assuming that |A| =
̸ 0 and |B| =
̸ 0, we can state the following:
|A∩B|
|A ∩ B| |Ω| P (A ∩ B)
P (A|B) = = |B|
= (4)
|B| P (B)
|Ω|
|B∩A|
|B ∩ A| |Ω| P (A ∩ B)
P (B|A) = = |A|
= (5)
|A| P (A)
|Ω|

From Equations (4) and (5), it is immediately obvious that:

P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A) (6)

and therefore:
P (B|A)P (A)
P (A|B) = (7)
P (B)
which is the simplest (and perhaps the most memorable) formulation of Bayes’ theorem.

4.1 Generalized Bayes’ Formula


If the sample space Ω can be divided into finitely many mutually exclusive events A1 , A2 , . . . , An ,
and if B is an event with P (B) > 0, which is a subset of the union of all Ai , then for
each Ai , the generalized Bayes’ formula is:

P (B|Ai )P (Ai )
P (Ai |B) = Pn (8)
j=1 P (B|Aj )P (Aj )

3
This can also be rewritten as:
P (B|A)P (A)
P (A|B) = (9)
P (B|A)P (A) + P (B|Ac )P (Ac )

Both Equations (5) and (6) follow from Equation (4) because of the total probability
theorem.

4.2 Posterior Probability of Hypotheses


Bayes’ theorem can be used to derive the posterior probability of a hypothesis given
observed data:
P (data|hypothesis)P (hypothesis)
P (hypothesis|data) = (10)
P (data)
where:
ˆ P (data|hypothesis) is the likelihood of the data given the hypothesis,

ˆ P (hypothesis) is the prior probability of the hypothesis,

ˆ P (data) is the probability of observing the data, irrespective of the specified hy-
pothesis.
The prior probability quantifies the a priori plausibility of the hypothesis. Often, the
data can arise under two competing hypotheses, H1 and H2 , with P (H1 ) = 1 − P (H2 ).
Let D denote the observed data. Then:
P (D|H1 )P (H1 )
P (H1 |D) = (11)
P (D|H1 )P (H1 ) + P (D|H2 )P (H2 )
and:
P (D|H2 )P (H2 )
P (H2 |D) = (12)
P (D|H1 )P (H1 ) + P (D|H2 )P (H2 )
From these, the posterior odds are:
P (H1 |D) P (D|H1 ) P (H1 )
= · (13)
P (H2 |D) P (D|H2 ) P (H2 )

Here, PP (D|H 1)
(D|H2 )
is the Bayes factor, which measures the evidence that the data provide
in favor of H1 against H2 .The Bayes factor is the ratio of the posterior odds of H1 to its
prior odds. The Bayes factor can be interpreted as a summary measure of the evidence
that the data provide in favor of the hypothesis H1 against its competing hypothesis H2 .
If the prior probability of H1 is the same as that of H2 (i.e., P (H1 ) = P (H2 ) = 0.5), then
the Bayes factor is the same as the posterior odds.
Note that in the simplest case, neither H1 nor H2 have any free parameters, and the
Bayes factor then corresponds to the likelihood ratio [7]. If, however, at least one of the
hypotheses (or models) has unknown parameters, then the conditional probabilities are
obtained by integrating over the entire parameter space of Hi [7]:
Z
P (D|Hi ) = P (D|θi , Hi )P (θi |Hi )dθi , (14)

4
where θi denotes the parameters under Hi .
Note that Eq. 13 shows the Bayes factor B12 for only two hypotheses, but of course we
may also consider more than just two. In that case, we can write Bij to denote the Bayes
factor of Hi against Hj . When only two hypotheses are considered, they are often referred
to as null hypothesis, H0 , and alternative hypothesis, H1 . Jeffreys suggests grouping the
values of B10 into grades [8] (Table 1).
It is instructive to compare the Bayes factor with the p-value from Fisherian signif-
icance testing. In short, the p-value is defined as the probability of obtaining a result
as extreme as (or more extreme than) the actually observed result, given that the null
hypothesis is true. The p-value is generally considered an evidential weight against the
null hypothesis: the smaller the p-value, the greater the weight against H0 . However, the
p-value can be a highly misleading measure of evidence because it overstates the evidence
against H0 [3, 4, 5]. A Bayesian calibration of p-values is described in [6]. This calibration
1
leads to the conclusion... Bayes factor bound, B̄ = − ep log p
, where p is the p-value. Note

Table 1: Interpretation of Bayes factor B01 according to [8].

Grade B01 Interpretation


0 B01 > 1 Null hypothesis H0 supported
1 1 > B01 > 0.32 Evidence against H0 , but not worse than a bare mention
2 0.32 > B01 > 0.10 Evidence against H0 substantial
3 0.10 > B01 > 0.032 Evidence against H0 strong
4 0.032 > B01 > 0.01 Evidence against H0 very strong
5 0.01 > B01 Evidence against H0 decisive

that B̄ is an upper bound on the Bayes factor over any reasonable choice of the prior
distribution of the hypothesis ‘H0 is not true,” which we may refer to as “alternative
hypothesis.” For example, a p-value of 0.01 corresponds to an odds of, at most, about 8
to 1 in favor of ‘H0 is not true.” So far, we have considered only the discrete case, i.e.,
when the sample space is countable. What if the variables are continuous? Let X and Y
denote two continuous random variables with joint probability density function fXY (x, y).
Let fX|Y (x|y) and fY |X (y|x) denote their conditional probability density functions. Then
fXY (x, y)
fX|Y (x|y) = (15)
fY (y)
and
fXY (x, y)
fY |X (y|x) = (16)
fX (x)
so that Bayes’ theorem for continuous variables can be stated as:
fY |X (y|x)fX (x)
fX|Y (x|y) = (17)
fY (y)
R R
where fY (y) = x
fY |X (y|x)fX (x)dx = fXY (x, y)dx because of the total probability
theorem.

In summary, Bayes’ theorem provides a logical method that combines new evidence
(i.e., new data, new observations) with prior probabilities of hypotheses in order to obtain
posterior probabilities of these hypotheses.

5
4.3 Naive Bayes Classifier
We assume that a data set contains n instances (or cases) xi , i = 1, n, which consist
of p attributes, i.e., xi = (xi1 , xi2 , . . . , xip ). Each instance is assumed to belong to one
(and only one) class y (y ∈ {y1 , y2 , . . . , yc }). Most predictive models in machine learning
generate a numeric score s for each instance xi . This score quantifies the degree of
class membership of that case in class yj . If the data set contains only positive and
negative instances (y ∈ {0, 1}), then a predictive model can often be used as a ranker
or as a classifier. The ranker uses the scores to order the instances from the most to
the least likely to be positive. By setting a threshold t on the ranking score, s(x), such
that s(x) ≥ t → 1, the ranker becomes a (crisp) classifier [?]. Naive Bayes learning
refers to the construction of a Bayesian probabilistic model that assigns a posterior class
probability to an instance: P (y = yj |X = xi ). The simple Naive Bayes classifier uses
these probabilities to assign an instance to a class. Applying Bayes’ theorem (Eq. 7), and
simplifying the notation a little, we obtain

P (xi |yj )P (yj )


P (yj |xi ) = (18)
P (xi )

Note that the numerator in Eq. 18 is the joint probability of x and yj (cf. Eq. 6). The
numerator can therefore be rewritten as follows; here, we will just use x, omitting the
index i for simplicity:
P (x, yj ) = P (x|yj )P (yj )
= P (x1 , x2 , . . . , xp , yj )
= P (x1 |x2 , . . . , xp , yj )P (x2 , . . . , xp , yj ) because P (a, b) = P (a|b)P (b)
= P (x1 |x2 , . . . , xp , yj )P (x2 |x3 , . . . , xp , yj ) . . . P (xp |yj )P (yj )
Let us assume that the individual xi are independent of each other. This is a strong as-
sumption, which is clearly violated in most practical applications but is therefore naive—
hence the name. This assumption implies that

P (x1 |x2 , . . . , xp , yj ) = P (x1 |yj ).

For example, thus, the joint probability of x and yj is


p
Y
P (x|yj )P (yj ) = P (xk |yj )P (yj ) (19)
k=1

which we can plug into Eq. 18, and we obtain


Qp
P (xk |yj )P (yj )
P (yj |x) = k=1 (20)
P (x)

Note that the denominator, P (x), does not depend on the class—for example, it is the
same for class yj and yi . P (x) acts as a scaling factor and ensures that the posterior
probability P (yj |x) is properly scaled. (i.e., a number between 0 and 1). When we
are interested in a crisp classification rule, that is, a rule that assigns each instance to
exactly one class, then we can simply calculate the value of the numerator for each class
and select that class for which this value is maximal. This rule is called the maximum

6
posterior rule (Eq. 21). The resulting “winning” class is also known as the maximum a
posteriori (MAP) class, and it is calculated as ŷ for the instance x as follows:
p
Y
ŷ = argmax P (xk |yj )P (yj ) (21)
yj
k=1

A model that implements Eq. 21 is called a (simple) naive Bayes classifier. A crisp
classification, however, is often not desirable. For example, in ranking tasks involving a
positive and a negative class, we are often more interested in how well a model ranks
the cases of one class in relation to the cases of the other class [?]. The estimated class
posterior probabilities are natural ranking scores. Applying again the total probability
theorem (Eq. 3), we can rewrite Eq. 20 as
Qp
k=1 P (xkQ|yj )P (yj )
P (yj |x) = Qp p (22)
k=1 P (xk |yj )P (yj ) + k=1 P (xk |y¬j )P (y¬j )

3. Examples
3.1 Application of Bayes’ Theorem in Medical Screening
Consider a population of people in which 1% really have a disease, D. A medical screening
test is applied to 1000 randomly selected persons from that population. It is known that
the sensitivity of the test is 0.90, and the specificity of the test is 0.91.
(a) If a tested person is really sick, then what is the probability of a positive test result
(i.e., the result of the test indicates that the person is sick)?
(b) If the test is positive, then what is the probability that the person is really sick?
The probability that a randomly selected person has the disease is given as P (D) = 0.01
and thus P (Dc ) = 0.99. The test result depends on the sensitivity and specificity of the
test. (a): The sensitivity of a test is defined as
TP
sensitivity =
TP + FN
where T P denotes the number of true positive predictions and F N denotes the number
of false negative predictions. Sensitivity is therefore the conditional probability that a
person is correctly identified as being sick, as it also called recall. Let ⊕ denote a positive
and ⊖ a negative test result, respectively.
The answer to (a) is therefore simple—in fact, it is already given: the conditional
probability P (⊕|D) is the same as the sensitivity, since the number of persons who are
really sick is the same as the number of true positive predictions (persons are sick and they
are correctly identified as such by the test) plus the number of false negative predictions
(persons are sick but they are not identified as such by the test). Thus,
TP
P (⊕|D) = = 0.9.
TP + FN
To answer (b), we use Bayes’ theorem and obtain:
P (⊕|D)P (D) 0.9 · 0.01
P (D|⊕) = c c
= = 0.092. (23)
P (⊕|D)P (D) + P (⊕|D )P (D ) 0.9 · 0.01 + 0.09 · 0.99

7
The only unknown in Eq. 18 is P (⊕|Dc ), which we can easily derive from the given
information: if the specificity is 0.91 or 91%, then the false positive rate must be 0.09 or
9%. But the false positive rate is the same as the conditional probability of a positive
result, given the absence of disease, i.e., P (⊕|Dc ) = 0.09.
It can be insightful to represent the given information in a confusion matrix (Table 2).
Here, the number of true negatives and false positives are rounded to the nearest integer.
From the table, we can readily infer the chance of disease given a positive test result as
9
, i.e., just a bit more than 9%.
9 + 89

D Dc Σ
⊕ TP = 9 FP = 89 98
Θc FN = 1 TN = 901 902
Σ 10 990 1000

Table 2: Confusion table for the example on medical screening.

The conditional probability P (D|⊕) is also known as the positive predictive value in
epidemiology or as precision in data mining and related fields. What is the implication
of this probability being around 0.09?
The numbers in this example refer to health statistics for breast cancer screening
with mammography [?]. A positive predictive value of just over 9% means that only
about 1 out of every 10 women with a positive mammogram actually has breast cancer;
the remaining 9 persons are falsely alarmed. Gigerenzer et al. [?] showed that many
gynecologists do not know the probability that a person has a disease given a positive
test result, even when they are given appropriate health statistics framed as conditional
probabilities. By contrast, if the information is reframed in terms of natural frequencies
9
(as in 9+89 in this example), then the information is often easier to understand.

3.2 Naive Bayes Classifier – Introductory Example


We illustrate naive Bayes learning using the contrived data set4 shown in Table 3. The
first 14 instances refer to biological samples that belong to either the class tumor or the
class normal. These samples represent the training set. Each instance is described by
an expression profile of only four genes. Here, the gene expression values are discretized
into either underexpressed (−1), overexpressed (+1), or normally expressed (0). Sample
#15 represents a new biological sample. What is the likely class of this sample? Note
that the particular combination of features, x15 = (+1, −1, +1, +1), does not appear in
the training set.
Using Eq. 20, we obtain:

P (A = +1|tumor) · P (B = −1|tumor) · P (C = +1|tumor) · P (D = +1|tumor) · P (tum


P (tumor|x15 ) =
P (x15 )

Let’s begin with the prior probability of “tumor,” P (tumor). This probability can be
9
estimated as the fraction of tumor samples in the data set, i.e., P (tumor) = 14 .

8
Sample Gene A Gene B Gene C Gene D Class
1 +1 +1 +1 0 normal
2 +1 +1 +1 0 normal
3 0 +1 +1 0 tumor
4 +1 0 +1 0 tumor
5 +1 -1 0 0 tumor
6 0 -1 -1 0 normal
7 +1 -1 +1 +1 tumor
8 +1 -1 +1 0 normal
9 +1 -1 0 0 tumor
10 +1 +1 0 0 tumor
11 +1 +1 +1 +1 tumor
12 0 0 0 0 tumor
13 0 0 +1 0 tumor
14 -1 0 +1 +1 normal
15 +1 -1 +1 +1 unknown

Table 3: Contrived gene expression data set of 15 biological samples, each described by
the discrete expression level of 4 genes. A sample belongs either to class “normal” or
“tumor.” Instance #15 is a new, unclassified sample.

What is the fraction of samples for which gene A is overexpressed (+1), given that the
class is “tumor”? As an estimate for this conditional probability, P (Gene A = +1|tumor),
the empirical value of 92 (cf. samples #9 and #11) will be used.
Next, to calculate P (B = −1|tumor), we proceed as follows: among the nine tumor
samples, for how many do we observe B = −1? We observe B = −1 for cases #5, #7,
and #9, so the conditional probability is estimated as 39 .
The remaining conditional probabilities are derived analogously. Thus, we obtain:
2
9
· 39 · 93 · 39 · 9
14 0.00529
P (tumor|x15) = = .
P (x15) P (x15 )
3
5
· 16 · 45 · 35 · 5
14 0.02057
P (normal | x15 ) = = .
P (x15 ) P (x15 )
With the denominator:

P (x15 ) = 0.00529 + 0.02057,

we then obtain the properly scaled probabilities:

P (tumor | x15 ) = 0.2046 and P (normal | x15 ) = 0.7954.

3.3 .Laplace Smoothing


When the number of samples is small, a problem may arise over how to correctly estimate
the probability of an attribute given the class. Let us assume that at least one attribute
value of the test instance, x, is absent in all training instances of a class yi . For example,

9
assume that Gene A of instance #9 and #11 in Table 3 are underexpressed (−1) instead
of overexpressed (+1). Then we obtain the following conditional probabilities:
0
P (Gene A = +1 | tumor) =
9
4
P (Gene A = 0 | tumor) =
9
5
P (Gene A = −1 | tumor) =
9
This obviously leads to P (x15 | tumor) = 0. If Gene A is underexpressed (−1) in
instances #9 and #11 in Table 3, then P (Gene A = +1 | tumor) = 0, which implies that
it is impossible to observe an overexpressed Gene A in a sample of class “tumor.” Is it
wise to make such a strong assumption? Probably not. It might be better to allow for
a small, non-zero probability. This is what Laplace smoothing does [?]. In this example,
we simply add 1 to each of the three numerators above and then add 3 to each of the
denominators:
0+1
P (Gene A = +1 | tumor) = ,
9+3
4+1
P (Gene A = 0 | tumor) = ,
9+3
5+1
P (Gene A = −1 | tumor) = .
9+3
However, instead of adding 1, we could also add a small positive constant c weighted
by pi :
0 + cp1
P (Gene A = +1 | tumor) = ,
9+c
4 + cp2
P (Gene A = 0 | tumor) = ,
9+c
5 + cp3
P (Gene A = −1 | tumor) = .
9+c
with p1 + p2 + p3 = 1, which are the prior probabilities for the states of expression for
Gene A. Although such a fully Bayesian specification is possible, in practice, it is often
unclear how the priors should be estimated, and simple Laplace is often appropriate [?].

3.4 Mixed Variables


In contrast to many other machine learning models, the naive Bayes classifier can easily
cope with mixed-variable data sets. For example, consider Table 4. Here, Gene B has
numeric expression values.
Assuming that the expression values of Gene B follow a normal distribution, we can
model the probability density for class yi as
(x−µi ) 2
1 − 2
f (x | yi ) = √ e 2σi (24)
2πσi
where µi and σi denote the mean and standard deviation of the gene expression value
for class yi , respectively. Of course, in practice, other distributions are possible, and

10
Sample Gene A Gene B Gene C Gene D Class
1 +1 35 +1 0 normal
2 +1 30 +1 +1 normal
3 +1 24 0 0 tumor
4 -1 20 +1 0 tumor
5 -1 15 0 0 tumor
6 -1 15 0 +1 tumor
7 0 11 0 +1 tumor
8 0 12 0 0 normal
9 +1 14 0 0 tumor
10 +1 24 0 0 tumor
11 +1 30 +1 +1 tumor
12 +1 28 +1 +1 tumor
13 -1 23 0 0 tumor
14 -1 21 +1 +1 normal
15 +1 12 +1 +1 unknown

Table 4: Contrived gene expression data set from Table 3. Here, absolute expression
values are reported for Gene B.

we need to choose the density that best describes the data. In this example, we obtain
µtumor = 24, σtumor = 7.7, and µnormal = 24, σnormal = 8.5.
Note that the probability that a continuous random variable X takes on a particular
value is always zero for any continuous probability distribution, i.e., P (X = x) = 0.
However, using the probability density function, we can calculate the probability that X
lies in a narrow interval [x0 − 2ϵ , x0 + 2ϵ ] around x0 as ϵ · f (X = x0 ). For the new instance
x15 (Table 4), we obtain f (12 | tumor) = 0.02267 and f (12 | normal) = 0.01676, so that
we can state the conditional probabilities as:
2
9
· 0.0227ϵ · 39 · 39 · 9
14 0.00036ϵ
P (tumor|x15 ) = =
P (x15 ) P (x15 )
and
3
5
· 0.01676ϵ · 45 · 35 · 5
14 0.00172ϵ
P (normal|x15 ) = = .
P (x15 ) P (x15 )
The posterior probabilities are:
0.00036ϵ
P (tumor|x15 ) = = 0.17
0.00036ϵ + 0.00172ϵ
and
0.00172ϵ
P (normal|x15 ) = = 0.83.
0.00036ϵ + 0.00172ϵ
Note that P (x15 ) cancels.

3.5 Missing Value Imputation


Missing values do not present any problem for the naive Bayes classifier. Let us assume
that the new instance contains missing values (encoded as NA), for example, x15 =

11
(+1, NA, +1, +1). The posterior probability for class yi can then be calculated by simply
omitting this attribute, i.e.,
2
9
· 39 · 93 · 14
9
0.016
P (tumor|x15 ) = =
P (x15 ) P (x15 )
and
3
5
· 45 · 35 · 14
5
0.103
P (normal|x15 ) = = .
P (x15 ) P (x15 )
If the training set has missing values, then the conditional probabilities can be cal-
culated by omitting these values. For example, suppose that the value +1 is missing
for Gene A in sample #1 (Table 3). What is the probability that Gene A is overex-
pressed (+1), given that the sample is normal? There are five normal samples, and two
of them (#2 and #8, Table 4) have an overexpressed Gene A. Therefore, the conditional
probability is calculated as
2
P (Gene A = +1|normal) = .
5

4.Discussion
In this article ,we derived Bayes’ theorem from the fundamental concepts of prob-
ability.We then presented one member of the family of machine learning methods that are
based on this theorem, the navie Bayes classifier ,which is one of the oldest workhorses
of machine learning.
It is well known that the misclassification error rate is minimized if each in-
stance is classified as a member of that class for which its conditional class posterior
probability is maximal.Consequently , the navie Bayes classifier id optimal(ef.Eq.21)in
the sense that no other classifier is expected to achieve a samller misclassification error
rate ,provided that the feature are independent . Howerver ,this assumption is a rather
strong one; clearly, in the vast majority of real-world classification problems, this assump-
tion is violated. This is particularly true for genomic data sets with many co-expressed
gense. Perhaps surprisingly, however, the navie Bayes classifier has demonstrated excel-
lent performance even the data set attributes are not independent.[15,16].
The perfomance of the navie Bayes classifier can often be improved by elimi-
nating highly correlated feature. For example, assume that we add ten additional gense
to the date set shown in Table 4, where each gene is described by expression values
that are highly corralated to those of Gene B.This means that the estimated conditional
probabilities will be domiated by those values, which would ”swamp out” the information
contained in the remaining genes.

5. Closing Remarks
Harold Jeffreys,a pioneer of modern statistics, succinctly stated the importance
of Bayes’theorem :”[Bayes’ theorem] is to the theory of probability what Pythagoras’
theorem is to geometry .”Indeed, Bayes’ theroem is of fundamental importance not only
for inferential statistics , but also for machine learning ,as it underpins the navie Bayes

12
classifier has demonstrated excellent performance compared to more sophisticated models
in a range of applications,including turnor classification based on gene expression profiling
[19]. The navie Bayes classifier performs remarkable well even when the underlying
independence assumption is violated.

5. K-Nearest Neighbour Classification


The Nearest Neighbour Classification Algorithm is the simplest of all machine learning
algorithms. It is based on the principle that the samples that are similar, generally lies in
close vicinity. K- Nearest Neighbour is instance based learning method. Instance based
classifiers are also called lazy learners as they store all of the training samples and do
not build a classifier untill a new, unlabeled sample needs to be classified . Lazy-learning
algorithms require less computation time during the training phase than eager-learning
algorithms (such as decision trees,neural networks and bayes networks) but more compu-
tation time during the classification process.

Nearest- neighbour classification are based on learning by resemblance , i.e., by com-


paring a given test sample with the available training samples which are similar to it. For
a data sample X to be classified,its K- nearest neighbours are searched and then X is
assigned to class lable to which majority of its neighbors belongs to.The choice of k also
affects the performance of k- nearest neighbour algorithm . If the value of k is too small,
then K- NN classifier may be vulnerable to over fitting because of noise present in the
training dataset. On the other hand, if k is too large ,the nearesr - neighbour classifier
may misclassify the test sample because its list of nearest neighbours may contain some
data points that are located far away from its neighbourhood.

K-NN fundamentally works on the belief that the data is connected in a feature space
.Hence, all the points are considered in order,to find out the distance among the data
points. Euclidian distanceor hamming distance is used according to the data type of data
classes used.In this a single value of K is given which is used to find the total number
of nearest neighbour that deteermine the class label for unknown sample. If the value of
K=1, the it is called as nearest neighbour classification.
K-NN Classifier works:
ˆ Initialize value of K.

ˆ Calculate distance between input sample and training samples.

ˆ Sort the distances.

ˆ Take top K- nearest neighbours.

ˆ Apply simple majority.

ˆ Predict class label with more neighbour for input sample.

Advantage
ˆ Easy to understand and implement.

ˆ Training is very fast.

13
Figure 2: An Example of K-NN classifier .

ˆ It is robust to noisy training data.

ˆ It performs well on applications in which a sample can have many labels.

Disadvantages

ˆ Lazy learners incur expensive computational costs when the number of potential
neighbore which t compare a given unlabeled sample is large.

ˆ It is sensitive to the local structure of the data.

ˆ Memory limitation.

ˆ As it is supervised lazy learner,it runs slowly.

6. Decision Tree Induction


Decision tree learning uses a decision tree as a predictive model which maps obser-
vation about an item to conclusions about the item’s target value. Decision tree algorithm
is a data mining induction techniques that recursively partitions a dataset of records us-
ing depth-first greedy approach or breadth-first approach untill all the data items belong
to a particular class.
A decision tree structure is made of root,internal and leaf nodes. It is a flow chart
like tree structure , where every internal node denotes a test condition on an attribute,
each branch represents result of the test condition, and each leafnode is assigned with a
class label. The topmost node is the root node. Decision tree is constructured in a divide
and conquer approach . Each path in decision tree forms a decision rule. Generally , it
utilizes greedy approach from top to bottom.
Decision tree classification technique is performed in two phases : tree building
and tree pruning . Tree building is performed in top-down approach. During this phase,
the tree is recursively partitioned till all the data items belong to the same class label.
It is very computationally intensive as the training dataset is traversed repeatedly. Tree
pruning is done in a bottom-up manner .It is used to improve the prediction and classifica-
tionn accuracy of the algorithm by minimizing over- fitting problem of tree. Over-fitting
problem in decision tree results in misclassification error.
There are many decision tree based algorithms like ID3,C4.5,C5.0,CART etc.
These algorithms have the merits of high classifying speed,strong learning ability and
simple construction .Decision tree can be explained with an example as depicted below.

14
Figure 3: An Example of decision tree induction.

Example shown a weather forecasting process which deals with predicting weather is
sunny, overcast or rainy and the amount of humidity if it is sunny. this tree model can
be applied to determine whether the atmosphere is suitable to play the terms or not. So
, a person can easily find the present climate and based on that decision can be made
whether match can be possible or not.
Advantages:
ˆ Decision Tree are very simple and fast.
ˆ It produces the accurate result.
ˆ Representation is easy to understand i.e.comprehensible.
ˆ It supports incremental learning.
ˆ It takes the less memory.
ˆ It can also deal with noisy data.
ˆ It uses different measure such as Entropy, Gini index, information gain etc., to find
best split attribute.
Disadvantages:
ˆ It has long training time.
ˆ Decision tree can have significantly more complex representation for some concepts
due to replication problem.
ˆ It has a problem of over fitting.

6. Comparison among K-NN Navie Bayes and Deci-


sion Tree techniquess”
shows the comparison between K-NN ,Navie Bayes and Decision Tree Techniques

15
Parameter K-NN Navie
Deterministic/ Non -deterministic Non-deterministic Non- Det
Effectiveness on Small data Huge
Speed Slower for large data Faster th
Dataset It can’t deal with noisy data It can dael w
Accuracy Provides high accuracy For obtaining good results it requ

7. Result
Following tables shows summary of results of implementation of different classifiers
using WEKA tool. Show the results of accuracy of classifiers. Show the results of time
taken by classifiers for classifying given dataset.

Table 5: Results of Accuracy of Classifiers

Dataset Size of Dataset KNN Naı̈ve Bayes Decision Tree


Weather Nominal Small (14 instances) 100% 92.857% 100%
Segment Challenge Medium (1500 instances) 100% 81.667% 99%
Supermarket Large (4627 instances) 89.842% 63.713% 63.713%

Table 6: Results of Time Taken for Classification

Dataset Size of Dataset Time KNN Naive Bayes Decis


Weather Nominal Small (14 instances) To Build Model 0 sec 0 sec 0.0
To Test Model 0.02 sec 0 sec 0
Segment Challenge Medium (1500 instances) To Build Model 0 sec 0.08 sec 0.1
To Test Model 0.42 sec 0.31 sec 0.0
Super Market Large (4627 instances) To Build Model 0.02 sec 0.06 sec 0.0
To Test Model 45.55 sec 0.28 sec 0.0

16

You might also like