Bayesian_theory_daniel_restrepo
Bayesian_theory_daniel_restrepo
There is a example, normally speaking if you know you got a fair coin the
probability will be .5, but, if you suspect that you got a charge coin you can
estimate the probability doing a Bernoulli process.
The focus is basically analyzing past transactions, the bank is planning to identify
good and bad customers from their bank accounts. They have the customer
yearly income and savings which are fundamental to build the classification
model.
Bayes’ rule
Explain de posterior, prior (historical data), likelihood (conditional probability, IF-
THEN rule), and evidence concepts. Join probability, it has to be exhaustive and
excluded.
• Prior: P(C=1) is call prior probability that C takes the value 1, but it
depends in the situation. Is the conditional probability based on the
knowledge
• Likelihood: Conditional probability that an event belonging to C associated
observation value X.
• Evidence: is the marginal probability that an observation X is seen,
regardless of whether it is a positive or negative example.
• Posterior: Combining the prior and what the data tells us using Baye’s
rule, it is calculated the posterior probability.
The bayes’ classifier chooses the class with the highest posterior probability.
Choose classes, after you calculate de probability of multiple classes, so you will
get the highest probability.
An action define αi as the decision to assign the input to class C1 and λ as the
loss incurred for taking action αi when the input actually belongs to Ck.
The choose action is the one with a minimum risk.
λ: depends on two variables, the first is the action
λ (s) C0 Low risk C0 High risk
give α0 0 1
deny α1 1 0
1-P(Ci|x)=min1-P(ci|x)
K
P(ci|x)=maxP(ci|x)
K
0 accepting
λ is the cost of rejecting <
1 otherwise
Then:
You have to compare the risk of all actions and also must get the action that
gives you the minimum risk. On the other hand, you look for the minimum risk
and you can decide. At the end you must compare the choose related with the
minimum risk and the minimum rejecting and decide:
Reject Otherwise
In some cases, wrong decisions (misclassifications) may have a very high cost,
and it is required a complex system.
Discriminant functions
The aim is establish a model able to discriminate classes. Basically, classification
can also be seen as implementing a set of discriminant functions. There ar some
ways to partition the space to choose and option related with the data include in
the model. (Figure 1)
Figure 1: example of decision regions and decision boundaries. (Alpaidyn, 2004)
In the way to discriminate a G or set of G’s, it will be given you the discriminated
functions.
C1 if g (x)>0
and choose <
C2 otherwise
The two-class learning problem where the positive examples can be taken as C1
and the negative examples as C2.
Classification system
Dichotomizer K = 2 Classes
Polychotomizer K > 2 Classes
Utility Theory
It is possible to generalize the problem of the utility theory thinking about the
approach related to the expected risk and chose the action that minimizes
expected risk. The utility theory is concerned with making rational decisions when
we are uncertain about the state.
Note that maximizing expected utility is just one possibility; one may define other
types of rational behaviour, for example, minimizing worst possible loss.
Value of information
It is relevant to evaluate the quality of the information. So, it is relevant to decide
about good and bad information because this is part and is one of the most
important related characteristics of the model.
Bayesian networks
This method is also called belief networks or a probabilistic network is the model
and is one of the most used methods at the moment.
• Graphical model.
• Representing interaction between variables visually.
• Composed of nodes and arcs between the nodes.
• Each node corresponds to the random variable, X, an has a value
corresponding to a probability.
• If there is a direct arc X to Y, means that X has a direct influence on Y.
• Direct acyclic graph (DAG), there are no cycles.
• The nodes and the arcs define the structure of the network.
For the model given, C, R and S are independent, this is part of the advantage of
Bayesian networks, which explicitly encode independencies and allow breaking
down inference into calculation over small groups of variables.
Junction Tree
An algorithm, which converts a given directed acyclic graph to a tree by
clustering variables, so, that belief propagation can be done.
One of the best advantage of using Bayesian networks is that we do not need to
designate explicitly certain variables as input and certain others as output. The
value of any set of variables can be established through evidence and the
probabilities of any othes set of variables can be inferred, and the differences
between unsupervised and supervised learning becomes blurry.
Bayesian networks, classification
A B
Influence diagrams
Association Rules
An association rule is an implication of the form X Y.
There are two measures:
Support shows the statistical significance of the rule whereas confidence shows
the strength of the rule.
Reference
ALPAYDIN, E. Introduction to Machine Learning. The MIT Press, October 2004,
ISBN 0-262-01211-1.
Seeger, M. Bayesian Modelling for Data Analysis and Learning from Data. Max-
Planck Institute for Biological Cybernetics. March 18, 2006.These notes provide
clarifying remarks and definitions complementing the course Bayesian Modelling
for Data Analysis and Learning from Data, to be held at IK 2006.
(http://www.kyb.tuebingen.mpg.de/bs/people/seeger/papers/handout.pdf)