P1-1
P1-1
2 Fundamentals
2.1 Basic Notation and Concepts
A statistical experiment can be broadly defined as a process that results in one
and only one of several possible outcomes. The collection of all possible outcomes is
called the sample space, denoted by Ω. At the introductory level, we can describe events
by using notation from set theory. For example, our experiment may be one roll of a fair
die. The sample space is then Ω = {1, 2, 3, 4, 5, 6}, which is also referred to as universal
set in the terminology of set theory. A simple event is, for instance, the outcome 2, which
we denote as E1 = {2}. The probability of an event is denoted by P (·). According to the
classic concept of probability, the probability of an event E is the number of outcomes
that are favorable to this event, divided by the total number of possible outcomes for the
experiment:
|E|
P (E) = ,
|Ω|
where |E| denotes the cardinality of the set E, i.e., the number of elements in E. In our
example, the probability of rolling a 2 is
|E1 | 1
P (E1 ) = = .
|Ω| 6
The event “the number is even” is a compound event, denoted by E2 = {2, 4, 6}. The
cardinality of E2 is 3, so the probability of this event is
3
P (E2 ) = .
6
1
The complement of E is the event that E does not occur and is denoted by E c , with
P (E c ) = 1 − P (E). In the example, E2c = {1, 3, 5}. Furthermore, P (A|B) denotes the
conditional probability of A given B. Finally, ∅ denotes the empty set, i.e., ∅ = {}.
Let A and B be two events from a sample space Ω, which is either finite with N
elements or countably infinite. Let P : Ω → [0, 1] be a probability distribution on Ω,
such that 0 < P (A) < 1 and 0 < P (B) < 1 and, obviously, P (Ω) = 1. We can represent
these events in a Venn diagram (Fig. 1a). The union of the events A and B, denoted by
A ∪ B, is the event that either A or B or both occur. The intersection of the events A
and B, denoted by A ∩ B, is the event that both A and B occur. Finally, two events, A
and B, are called mutually exclusive if the occurrence of one of these events rules out the
possibility of occurrence of the other event. In the notation of set theory, this means that
A and B are disjoint, i.e., A∩B = ∅. Two events A and B, with P (A) > 0 and P (B) > 0,
are called independent if the occurrence of one event does not affect the probability of
occurrence of the other event, i.e.,
P (A ∩ B) = P (A) · P (B).
Note that the conditional probability, P (A|B), is the joint probability P (A∩B) divided
by the marginal probability P (B). This is a fundamental relation, which has a simple
geometrical interpretation. Loosely speaking, given that we are in the ellipse B (Fig.
1a), what is the probability that we are also in A? To also be in A, we have to be in the
intersection A ∩ B. Hence, the probability is the number of elements in the intersection,
|A ∩ B|, divided by the number of elements in B, i.e., |B|. Formally,
|A ∩ B|/|Ω| P (A ∩ B)
P (A|B) = = .
|B|/|Ω| P (B)
Figure 1.1: (a) Venn diagram for sets A and B .(b) Illustration of the total probability
theorem.The sample sapceΩ is divided into five disjoint sets A1 to A5 , which partly
overlap with set B.
2
We assume that the sample space can be divided into n mutually exclusive events
Ai ,i=1. . . n, as shown in Figure 1(b). Specifically,
1. A1 ∪ A2 ∪ A3 ∪ · · · ∪ An = Ω
2. Ai ∩ Aj = ∅ fori ̸= j
3. Ai ̸= ∅
4 Bayes’ Theorem
Assuming that |A| =
̸ 0 and |B| =
̸ 0, we can state the following:
|A∩B|
|A ∩ B| |Ω| P (A ∩ B)
P (A|B) = = |B|
= (4)
|B| P (B)
|Ω|
|B∩A|
|B ∩ A| |Ω| P (A ∩ B)
P (B|A) = = |A|
= (5)
|A| P (A)
|Ω|
and therefore:
P (B|A)P (A)
P (A|B) = (7)
P (B)
which is the simplest (and perhaps the most memorable) formulation of Bayes’ theorem.
P (B|Ai )P (Ai )
P (Ai |B) = Pn (8)
j=1 P (B|Aj )P (Aj )
3
This can also be rewritten as:
P (B|A)P (A)
P (A|B) = (9)
P (B|A)P (A) + P (B|Ac )P (Ac )
Both Equations (5) and (6) follow from Equation (4) because of the total probability
theorem.
P (data) is the probability of observing the data, irrespective of the specified hy-
pothesis.
The prior probability quantifies the a priori plausibility of the hypothesis. Often, the
data can arise under two competing hypotheses, H1 and H2 , with P (H1 ) = 1 − P (H2 ).
Let D denote the observed data. Then:
P (D|H1 )P (H1 )
P (H1 |D) = (11)
P (D|H1 )P (H1 ) + P (D|H2 )P (H2 )
and:
P (D|H2 )P (H2 )
P (H2 |D) = (12)
P (D|H1 )P (H1 ) + P (D|H2 )P (H2 )
From these, the posterior odds are:
P (H1 |D) P (D|H1 ) P (H1 )
= · (13)
P (H2 |D) P (D|H2 ) P (H2 )
Here, PP (D|H 1)
(D|H2 )
is the Bayes factor, which measures the evidence that the data provide
in favor of H1 against H2 .The Bayes factor is the ratio of the posterior odds of H1 to its
prior odds. The Bayes factor can be interpreted as a summary measure of the evidence
that the data provide in favor of the hypothesis H1 against its competing hypothesis H2 .
If the prior probability of H1 is the same as that of H2 (i.e., P (H1 ) = P (H2 ) = 0.5), then
the Bayes factor is the same as the posterior odds.
Note that in the simplest case, neither H1 nor H2 have any free parameters, and the
Bayes factor then corresponds to the likelihood ratio [7]. If, however, at least one of the
hypotheses (or models) has unknown parameters, then the conditional probabilities are
obtained by integrating over the entire parameter space of Hi [7]:
Z
P (D|Hi ) = P (D|θi , Hi )P (θi |Hi )dθi , (14)
4
where θi denotes the parameters under Hi .
Note that Eq. 13 shows the Bayes factor B12 for only two hypotheses, but of course we
may also consider more than just two. In that case, we can write Bij to denote the Bayes
factor of Hi against Hj . When only two hypotheses are considered, they are often referred
to as null hypothesis, H0 , and alternative hypothesis, H1 . Jeffreys suggests grouping the
values of B10 into grades [8] (Table 1).
It is instructive to compare the Bayes factor with the p-value from Fisherian signif-
icance testing. In short, the p-value is defined as the probability of obtaining a result
as extreme as (or more extreme than) the actually observed result, given that the null
hypothesis is true. The p-value is generally considered an evidential weight against the
null hypothesis: the smaller the p-value, the greater the weight against H0 . However, the
p-value can be a highly misleading measure of evidence because it overstates the evidence
against H0 [3, 4, 5]. A Bayesian calibration of p-values is described in [6]. This calibration
1
leads to the conclusion... Bayes factor bound, B̄ = − ep log p
, where p is the p-value. Note
that B̄ is an upper bound on the Bayes factor over any reasonable choice of the prior
distribution of the hypothesis ‘H0 is not true,” which we may refer to as “alternative
hypothesis.” For example, a p-value of 0.01 corresponds to an odds of, at most, about 8
to 1 in favor of ‘H0 is not true.” So far, we have considered only the discrete case, i.e.,
when the sample space is countable. What if the variables are continuous? Let X and Y
denote two continuous random variables with joint probability density function fXY (x, y).
Let fX|Y (x|y) and fY |X (y|x) denote their conditional probability density functions. Then
fXY (x, y)
fX|Y (x|y) = (15)
fY (y)
and
fXY (x, y)
fY |X (y|x) = (16)
fX (x)
so that Bayes’ theorem for continuous variables can be stated as:
fY |X (y|x)fX (x)
fX|Y (x|y) = (17)
fY (y)
R R
where fY (y) = x
fY |X (y|x)fX (x)dx = fXY (x, y)dx because of the total probability
theorem.
In summary, Bayes’ theorem provides a logical method that combines new evidence
(i.e., new data, new observations) with prior probabilities of hypotheses in order to obtain
posterior probabilities of these hypotheses.
5
4.3 Naive Bayes Classifier
We assume that a data set contains n instances (or cases) xi , i = 1, n, which consist
of p attributes, i.e., xi = (xi1 , xi2 , . . . , xip ). Each instance is assumed to belong to one
(and only one) class y (y ∈ {y1 , y2 , . . . , yc }). Most predictive models in machine learning
generate a numeric score s for each instance xi . This score quantifies the degree of
class membership of that case in class yj . If the data set contains only positive and
negative instances (y ∈ {0, 1}), then a predictive model can often be used as a ranker
or as a classifier. The ranker uses the scores to order the instances from the most to
the least likely to be positive. By setting a threshold t on the ranking score, s(x), such
that s(x) ≥ t → 1, the ranker becomes a (crisp) classifier [?]. Naive Bayes learning
refers to the construction of a Bayesian probabilistic model that assigns a posterior class
probability to an instance: P (y = yj |X = xi ). The simple Naive Bayes classifier uses
these probabilities to assign an instance to a class. Applying Bayes’ theorem (Eq. 7), and
simplifying the notation a little, we obtain
Note that the numerator in Eq. 18 is the joint probability of x and yj (cf. Eq. 6). The
numerator can therefore be rewritten as follows; here, we will just use x, omitting the
index i for simplicity:
P (x, yj ) = P (x|yj )P (yj )
= P (x1 , x2 , . . . , xp , yj )
= P (x1 |x2 , . . . , xp , yj )P (x2 , . . . , xp , yj ) because P (a, b) = P (a|b)P (b)
= P (x1 |x2 , . . . , xp , yj )P (x2 |x3 , . . . , xp , yj ) . . . P (xp |yj )P (yj )
Let us assume that the individual xi are independent of each other. This is a strong as-
sumption, which is clearly violated in most practical applications but is therefore naive—
hence the name. This assumption implies that
Note that the denominator, P (x), does not depend on the class—for example, it is the
same for class yj and yi . P (x) acts as a scaling factor and ensures that the posterior
probability P (yj |x) is properly scaled. (i.e., a number between 0 and 1). When we
are interested in a crisp classification rule, that is, a rule that assigns each instance to
exactly one class, then we can simply calculate the value of the numerator for each class
and select that class for which this value is maximal. This rule is called the maximum
6
posterior rule (Eq. 21). The resulting “winning” class is also known as the maximum a
posteriori (MAP) class, and it is calculated as ŷ for the instance x as follows:
p
Y
ŷ = argmax P (xk |yj )P (yj ) (21)
yj
k=1
A model that implements Eq. 21 is called a (simple) naive Bayes classifier. A crisp
classification, however, is often not desirable. For example, in ranking tasks involving a
positive and a negative class, we are often more interested in how well a model ranks
the cases of one class in relation to the cases of the other class [?]. The estimated class
posterior probabilities are natural ranking scores. Applying again the total probability
theorem (Eq. 3), we can rewrite Eq. 20 as
Qp
k=1 P (xkQ|yj )P (yj )
P (yj |x) = Qp p (22)
k=1 P (xk |yj )P (yj ) + k=1 P (xk |y¬j )P (y¬j )
3. Examples
3.1 Application of Bayes’ Theorem in Medical Screening
Consider a population of people in which 1% really have a disease, D. A medical screening
test is applied to 1000 randomly selected persons from that population. It is known that
the sensitivity of the test is 0.90, and the specificity of the test is 0.91.
(a) If a tested person is really sick, then what is the probability of a positive test result
(i.e., the result of the test indicates that the person is sick)?
(b) If the test is positive, then what is the probability that the person is really sick?
The probability that a randomly selected person has the disease is given as P (D) = 0.01
and thus P (Dc ) = 0.99. The test result depends on the sensitivity and specificity of the
test. (a): The sensitivity of a test is defined as
TP
sensitivity =
TP + FN
where T P denotes the number of true positive predictions and F N denotes the number
of false negative predictions. Sensitivity is therefore the conditional probability that a
person is correctly identified as being sick, as it also called recall. Let ⊕ denote a positive
and ⊖ a negative test result, respectively.
The answer to (a) is therefore simple—in fact, it is already given: the conditional
probability P (⊕|D) is the same as the sensitivity, since the number of persons who are
really sick is the same as the number of true positive predictions (persons are sick and they
are correctly identified as such by the test) plus the number of false negative predictions
(persons are sick but they are not identified as such by the test). Thus,
TP
P (⊕|D) = = 0.9.
TP + FN
To answer (b), we use Bayes’ theorem and obtain:
P (⊕|D)P (D) 0.9 · 0.01
P (D|⊕) = c c
= = 0.092. (23)
P (⊕|D)P (D) + P (⊕|D )P (D ) 0.9 · 0.01 + 0.09 · 0.99
7
The only unknown in Eq. 18 is P (⊕|Dc ), which we can easily derive from the given
information: if the specificity is 0.91 or 91%, then the false positive rate must be 0.09 or
9%. But the false positive rate is the same as the conditional probability of a positive
result, given the absence of disease, i.e., P (⊕|Dc ) = 0.09.
It can be insightful to represent the given information in a confusion matrix (Table 2).
Here, the number of true negatives and false positives are rounded to the nearest integer.
From the table, we can readily infer the chance of disease given a positive test result as
9
, i.e., just a bit more than 9%.
9 + 89
D Dc Σ
⊕ TP = 9 FP = 89 98
Θc FN = 1 TN = 901 902
Σ 10 990 1000
The conditional probability P (D|⊕) is also known as the positive predictive value in
epidemiology or as precision in data mining and related fields. What is the implication
of this probability being around 0.09?
The numbers in this example refer to health statistics for breast cancer screening
with mammography [?]. A positive predictive value of just over 9% means that only
about 1 out of every 10 women with a positive mammogram actually has breast cancer;
the remaining 9 persons are falsely alarmed. Gigerenzer et al. [?] showed that many
gynecologists do not know the probability that a person has a disease given a positive
test result, even when they are given appropriate health statistics framed as conditional
probabilities. By contrast, if the information is reframed in terms of natural frequencies
9
(as in 9+89 in this example), then the information is often easier to understand.
Let’s begin with the prior probability of “tumor,” P (tumor). This probability can be
9
estimated as the fraction of tumor samples in the data set, i.e., P (tumor) = 14 .
8
Sample Gene A Gene B Gene C Gene D Class
1 +1 +1 +1 0 normal
2 +1 +1 +1 0 normal
3 0 +1 +1 0 tumor
4 +1 0 +1 0 tumor
5 +1 -1 0 0 tumor
6 0 -1 -1 0 normal
7 +1 -1 +1 +1 tumor
8 +1 -1 +1 0 normal
9 +1 -1 0 0 tumor
10 +1 +1 0 0 tumor
11 +1 +1 +1 +1 tumor
12 0 0 0 0 tumor
13 0 0 +1 0 tumor
14 -1 0 +1 +1 normal
15 +1 -1 +1 +1 unknown
Table 3: Contrived gene expression data set of 15 biological samples, each described by
the discrete expression level of 4 genes. A sample belongs either to class “normal” or
“tumor.” Instance #15 is a new, unclassified sample.
What is the fraction of samples for which gene A is overexpressed (+1), given that the
class is “tumor”? As an estimate for this conditional probability, P (Gene A = +1|tumor),
the empirical value of 92 (cf. samples #9 and #11) will be used.
Next, to calculate P (B = −1|tumor), we proceed as follows: among the nine tumor
samples, for how many do we observe B = −1? We observe B = −1 for cases #5, #7,
and #9, so the conditional probability is estimated as 39 .
The remaining conditional probabilities are derived analogously. Thus, we obtain:
2
9
· 39 · 93 · 39 · 9
14 0.00529
P (tumor|x15) = = .
P (x15) P (x15 )
3
5
· 16 · 45 · 35 · 5
14 0.02057
P (normal | x15 ) = = .
P (x15 ) P (x15 )
With the denominator:
9
assume that Gene A of instance #9 and #11 in Table 3 are underexpressed (−1) instead
of overexpressed (+1). Then we obtain the following conditional probabilities:
0
P (Gene A = +1 | tumor) =
9
4
P (Gene A = 0 | tumor) =
9
5
P (Gene A = −1 | tumor) =
9
This obviously leads to P (x15 | tumor) = 0. If Gene A is underexpressed (−1) in
instances #9 and #11 in Table 3, then P (Gene A = +1 | tumor) = 0, which implies that
it is impossible to observe an overexpressed Gene A in a sample of class “tumor.” Is it
wise to make such a strong assumption? Probably not. It might be better to allow for
a small, non-zero probability. This is what Laplace smoothing does [?]. In this example,
we simply add 1 to each of the three numerators above and then add 3 to each of the
denominators:
0+1
P (Gene A = +1 | tumor) = ,
9+3
4+1
P (Gene A = 0 | tumor) = ,
9+3
5+1
P (Gene A = −1 | tumor) = .
9+3
However, instead of adding 1, we could also add a small positive constant c weighted
by pi :
0 + cp1
P (Gene A = +1 | tumor) = ,
9+c
4 + cp2
P (Gene A = 0 | tumor) = ,
9+c
5 + cp3
P (Gene A = −1 | tumor) = .
9+c
with p1 + p2 + p3 = 1, which are the prior probabilities for the states of expression for
Gene A. Although such a fully Bayesian specification is possible, in practice, it is often
unclear how the priors should be estimated, and simple Laplace is often appropriate [?].
10
Sample Gene A Gene B Gene C Gene D Class
1 +1 35 +1 0 normal
2 +1 30 +1 +1 normal
3 +1 24 0 0 tumor
4 -1 20 +1 0 tumor
5 -1 15 0 0 tumor
6 -1 15 0 +1 tumor
7 0 11 0 +1 tumor
8 0 12 0 0 normal
9 +1 14 0 0 tumor
10 +1 24 0 0 tumor
11 +1 30 +1 +1 tumor
12 +1 28 +1 +1 tumor
13 -1 23 0 0 tumor
14 -1 21 +1 +1 normal
15 +1 12 +1 +1 unknown
Table 4: Contrived gene expression data set from Table 3. Here, absolute expression
values are reported for Gene B.
we need to choose the density that best describes the data. In this example, we obtain
µtumor = 24, σtumor = 7.7, and µnormal = 24, σnormal = 8.5.
Note that the probability that a continuous random variable X takes on a particular
value is always zero for any continuous probability distribution, i.e., P (X = x) = 0.
However, using the probability density function, we can calculate the probability that X
lies in a narrow interval [x0 − 2ϵ , x0 + 2ϵ ] around x0 as ϵ · f (X = x0 ). For the new instance
x15 (Table 4), we obtain f (12 | tumor) = 0.02267 and f (12 | normal) = 0.01676, so that
we can state the conditional probabilities as:
2
9
· 0.0227ϵ · 39 · 39 · 9
14 0.00036ϵ
P (tumor|x15 ) = =
P (x15 ) P (x15 )
and
3
5
· 0.01676ϵ · 45 · 35 · 5
14 0.00172ϵ
P (normal|x15 ) = = .
P (x15 ) P (x15 )
The posterior probabilities are:
0.00036ϵ
P (tumor|x15 ) = = 0.17
0.00036ϵ + 0.00172ϵ
and
0.00172ϵ
P (normal|x15 ) = = 0.83.
0.00036ϵ + 0.00172ϵ
Note that P (x15 ) cancels.
11
(+1, NA, +1, +1). The posterior probability for class yi can then be calculated by simply
omitting this attribute, i.e.,
2
9
· 39 · 93 · 14
9
0.016
P (tumor|x15 ) = =
P (x15 ) P (x15 )
and
3
5
· 45 · 35 · 14
5
0.103
P (normal|x15 ) = = .
P (x15 ) P (x15 )
If the training set has missing values, then the conditional probabilities can be cal-
culated by omitting these values. For example, suppose that the value +1 is missing
for Gene A in sample #1 (Table 3). What is the probability that Gene A is overex-
pressed (+1), given that the sample is normal? There are five normal samples, and two
of them (#2 and #8, Table 4) have an overexpressed Gene A. Therefore, the conditional
probability is calculated as
2
P (Gene A = +1|normal) = .
5
4.Discussion
In this article ,we derived Bayes’ theorem from the fundamental concepts of prob-
ability.We then presented one member of the family of machine learning methods that are
based on this theorem, the navie Bayes classifier ,which is one of the oldest workhorses
of machine learning.
It is well known that the misclassification error rate is minimized if each in-
stance is classified as a member of that class for which its conditional class posterior
probability is maximal.Consequently , the navie Bayes classifier id optimal(ef.Eq.21)in
the sense that no other classifier is expected to achieve a samller misclassification error
rate ,provided that the feature are independent . Howerver ,this assumption is a rather
strong one; clearly, in the vast majority of real-world classification problems, this assump-
tion is violated. This is particularly true for genomic data sets with many co-expressed
gense. Perhaps surprisingly, however, the navie Bayes classifier has demonstrated excel-
lent performance even the data set attributes are not independent.[15,16].
The perfomance of the navie Bayes classifier can often be improved by elimi-
nating highly correlated feature. For example, assume that we add ten additional gense
to the date set shown in Table 4, where each gene is described by expression values
that are highly corralated to those of Gene B.This means that the estimated conditional
probabilities will be domiated by those values, which would ”swamp out” the information
contained in the remaining genes.
5. Closing Remarks
Harold Jeffreys,a pioneer of modern statistics, succinctly stated the importance
of Bayes’theorem :”[Bayes’ theorem] is to the theory of probability what Pythagoras’
theorem is to geometry .”Indeed, Bayes’ theroem is of fundamental importance not only
for inferential statistics , but also for machine learning ,as it underpins the navie Bayes
12
classifier has demonstrated excellent performance compared to more sophisticated models
in a range of applications,including turnor classification based on gene expression profiling
[19]. The navie Bayes classifier performs remarkable well even when the underlying
independence assumption is violated.
K-NN fundamentally works on the belief that the data is connected in a feature space
.Hence, all the points are considered in order,to find out the distance among the data
points. Euclidian distanceor hamming distance is used according to the data type of data
classes used.In this a single value of K is given which is used to find the total number
of nearest neighbour that deteermine the class label for unknown sample. If the value of
K=1, the it is called as nearest neighbour classification.
K-NN Classifier works:
Initialize value of K.
Advantage
Easy to understand and implement.
13
Figure 2: An Example of K-NN classifier .
Disadvantages
Lazy learners incur expensive computational costs when the number of potential
neighbore which t compare a given unlabeled sample is large.
Memory limitation.
14
Figure 3: An Example of decision tree induction.
Example shown a weather forecasting process which deals with predicting weather is
sunny, overcast or rainy and the amount of humidity if it is sunny. this tree model can
be applied to determine whether the atmosphere is suitable to play the terms or not. So
, a person can easily find the present climate and based on that decision can be made
whether match can be possible or not.
Advantages:
Decision Tree are very simple and fast.
It produces the accurate result.
Representation is easy to understand i.e.comprehensible.
It supports incremental learning.
It takes the less memory.
It can also deal with noisy data.
It uses different measure such as Entropy, Gini index, information gain etc., to find
best split attribute.
Disadvantages:
It has long training time.
Decision tree can have significantly more complex representation for some concepts
due to replication problem.
It has a problem of over fitting.
15
Parameter K-NN Navie
Deterministic/ Non -deterministic Non-deterministic Non- Det
Effectiveness on Small data Huge
Speed Slower for large data Faster th
Dataset It can’t deal with noisy data It can dael w
Accuracy Provides high accuracy For obtaining good results it requ
7. Result
Following tables shows summary of results of implementation of different classifiers
using WEKA tool. Show the results of accuracy of classifiers. Show the results of time
taken by classifiers for classifying given dataset.
16