IT8601 unitIV
IT8601 unitIV
Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and
Approximate Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning -
Supervised Learning - Learning Decision Trees – Regression and Classification with Linear
Models - Artificial Neural Networks – Nonparametric Models - Support Vector Machines -
Statistical Learning - Learning with Complete Data - Learning with Hidden Variables- The EM
Algorithm – Reinforcement Learning
1. Probability basics
• Random Variables
• Joint and Marginal Distributions
• Conditional Distribution
• Product Rule, Chain Rule
Random Variables
It is the basic element: random variable
• Similar to propositional logic: possible worlds defined by assignment of values to random
variables.
• Boolean random variables
e.g., Cavity (do I have a cavity?)
• Discrete random variables
e.g., Weather is one of <sunny,rainy,cloudy,snow>
• Domain values must be exhaustive and mutually exclusive
• Elementary proposition constructed by assignment of a value to a
random variable: e.g., Weather = sunny, Cavity = false (abbreviated as cavity)
A random variable is some aspect of the world about which we (may) have uncertainty
R = Is it raining?
T = Is it hot or cold?
D = How long will it take to drive to work?
L = Where is the ghost?
Axioms of probability
For any propositions A, B
– 0 ≤ P(A) ≤ 1
– P(true) = 1 and P(false) = 0
– P(A B) = P(A) + P(B) - P(A B)
Probability Distributions
Associate a probability with each value
Temperature T : P(T)
Weather W: P(W)
Shorthand notation:
P(hot)=P(T=hot)
P(cold)=P(T=cold)
P(rain)=P(W=rain)
A probability (lower case value) is a single number
P(rain)=P(W=rain)=0.1
Prior probability
Joint Distributions
A joint distribution over a set of random variables: X1, X2, X3,…Xn specifies a real number for
each assignment (or outcome):
P(X1= x1, X2= x2, X3= x3,…Xn =xn), P(x1,x2,x3,…xn)
It must obey
P(x1,x2,x3,…xn) >=0
P(T, W)
Events
Atomic event: A complete specification of the state of the world about which the agent is uncertain
E.g., if the world consists of only two Boolean variables Cavity and Toothache, then there are 4
distinct atomic events:
Cavity = false Toothache = false
Cavity = false Toothache = true
Cavity = true Toothache = false
Cavity = true Toothache = true
An event is a set E of outcomes
Marginal Distributions
▪ Marginal distributions are sub-tables which eliminate variables
▪ Marginalization (summing out): Combine collapsed rows by adding
P (T, W)
Conditional Probabilities
Conditional or posterior probabilities
e.g., P(cavity | toothache) = 0.8
i.e., given that toothache is all I know
P(Cavity | Toothache) = 2-element vector of 2-element vectors)
If we know more, e.g., cavity is also given, then we have
P(cavity | toothache, cavity) = 1
P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8
This kind of inference, sanctioned by domain knowledge, is crucial
▪ A simple relation between joint and conditional probabilities
P (T, W)
Conditional Distributions
▪ Conditional distributions are probability distributions over some variables given fixed values
of others
Example 2:
Probabilistic Inference
▪ compute a desired probability from other known probabilities (e.g. conditional from joint)
Compute conditional probabilities
▪ P(on time | no reported accidents) = 0.90
▪ These represent the agent’s beliefs given the evidence
Probabilities change with new evidence:
▪ P(on time | no accidents, 5 a.m.) = 0.95
▪ P(on time | no accidents, 5 a.m., raining) = 0.80
▪ Observing new evidence causes beliefs to be updated
Conditional Independence
▪ Conditional independence is our most basic and robust form of knowledge about uncertain
environments.
▪ X is conditionally independent of Y given Z
if and only if: or,
equivalently, if and only if
Example 1:
▪ P(Toothache, Cavity, Catch)
▪ If I have a cavity, the probability that the probe catches in it doesn't depend on whether I
have a toothache:
▪ P(+catch | +toothache, +cavity) = P(+catch | +cavity)
▪ The same independence holds if I don’t have a cavity:
▪ P(+catch | +toothache, -cavity) = P(+catch| -cavity)
▪ Catch is conditionally independent of Toothache given Cavity:
▪ P(Catch | Toothache, Cavity) = P(Catch | Cavity)
Equivalent statements:
▪ P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
▪ P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
▪ One can be derived from the other easily
Example 2:
▪ P(Traffic, Umbrella, Raining)
Example 3:
▪ P(Fire, Smoke, Alarm)
Example:
M: meningitis, S: stiff neck
3. Bayesian networks
A simple, graphical notation for conditional independence assertions and hence for compact
specification of full joint distributions
Syntax
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi))
In the simplest case, conditional distribution represented as a conditional probability table (CPT)
giving the distribution over Xi for each combination of parent values
• Topology of network encodes conditional independence assertions:
Compactness
• A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent
values
• Each row requires one number p for Xi = true (the number for Xi = false is just 1-p)
• If each variable has no more than k parents, the complete network requires O(n · 2k)
numbers i.e., grows linearly with n, vs. O(2n) for the full joint distribution
• For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)
P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)?
P(B | A, J, M) = P(B)?
P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes
▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over
time (also, initial state probabilities)
▪ Stationarity assumption: transition probabilities the same at all times
▪ Same as MDP transition model, but no choice of action
▪ From the chain rule, every joint distribution over can be written as:
Assuming that
and
From the chain rule, every joint distribution over can be written as:
▪ We assumed: and
Markov Models
Explicit assumption for all t :
Joint distribution can be written as:
Conditional Independence
▪ Basic conditional independence:
▪ Past and future independent of the present
▪ Each time step only depends on the previous
▪ This is called the (first order) Markov property
5. Forms of Learning
An agent is learning if it improves its performance LEARNING on future tasks after making
observations about the world.
Any component of an agent can be improved by learning from data. The improvements, and
the techniques depend on four major factors:
• Which component is to be improved.
• What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from.
Components to be learned
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible actions the agent
can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.
6. Goals that describe classes of states whose achievement maximizes the agent’s utility.
6. Supervised Learning
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1), (x2, y2), . . . (xN, yN) ,
where each yj was generated by an unknown function y = f(x),
A function h that approximates the true function f.
x and y can be any value; they need not be numbers. The function h is a hypothesis.
Learning is a search through the space of possible hypotheses for one that will perform well,
even on new examples beyond the training set. To measure the accuracy of a hypothesis we
give it a test set of examples that are distinct from the training set. We say a hypothesis
The examples are points in the (x, y) plane, where y = f(x). We don’t know what f is, but we will
approximate it with a function h selected from a hypothesis space, H, which for this example we will
take to be the set of polynomials, such as x5+3x2+2. The above figure shows some data with an exact
fit by a straight line (the polynomial 0.4x + 3). The line is called a consistent hypothesis because it
A learning problem is realizable if the hypothesis space contains the true function.
Classification
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning
problem is called classification, and is called Boolean or binary classification if there are only two values.
Regression
When y is a number (such as tomorrow’s temperature), the learning problem is called
regression.
Supervised learning can be done by choosing the hypothesis h∗ that is most probable given the
data:
h∗ = argmaxh∈H P(h|data) .
By Bayes’ rule this is equivalent to
h∗ = argmaxh∈H P(data|h) P(h) .
with the prior probability P(h)
Algorithm
Information Gain
𝒏
Outlook (9 Yes, 5 No) Entropy (S Outlook)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
Rainy (2 Yes, 3 No) Entropy (SRainy)= - 2/5 log2 (2/5) - 3/5 log2 (3/5) =0.971
Overcast(4Yes, 0 No) Entropy (SOvrecast)= - 4/4log2 (4/4) - 4/0 log2 (4/0) =0
Sunny (3 Yes, 2 No) Entropy (SSunny)= - 3/5 log2 (3/5) - 2/5 log2 (2/5) =0.971
Temperature Attribute
Values(Temp )={Hot, Mild, Cool}
Temp (9 Yes, 5 No) Entropy (S Temp)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
Hot (2 Yes, 2 No) Entropy (SHot)= - 2/4 log2 (2/4) - 2/4log2 (2/4) =1.0
Mild(4Yes, 2 No) Entropy (SOvrecast)= - 4/6 log2 (4/6) - 2/6 log2 (2/6) =0.918
Cool(3 Yes, 1 No) Entropy (SSunny)= - 3/4 log2 (3/4) - 1/4 log2 (1/4) =0.8113
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 19
Gain IG(S, Temp) = Entropy(S) - Sum (|si|/|s| Entropy(S) )
= Entropy (Temp) – 5/14 Entropy (SHot) – 4/14 Entropy (SMild) – 5/14 Entropy (SCool)
= 0.94- (4/14) * 1.0 – (6/14)* 0.918 –(4/14)* 0.8113
= 0.0289
Humidity Attribute
Values(Humidity )={High, Normal}
Humidity (9 Yes, 5 No) Entropy (S Humidity)= - 9/14 log2 (9/14) - 5/14 log2 (5/14)
= - (0.6428* -0.6374) – (0.357* -1.4854 )
= 0.4097+0.5303= 0.94
High ( 3Yes, 3 No) Entropy (SHigh)= - 3/6 log2 (3/6) - 3/6log2 (3/6)
= -(0.5 * -1) – (0.5*-1)
= 0.5+.05= 1
Normal(7 Yes, 1 No) Entropy (SNormal)= - 7/8 log2 (7/8) - 1/8 log2 (1/8)
= -(0.875*-0.19265)-(0.125*-3)
=0.1686+0.375 = 0.5436
Wind Attribute
Values(Wind)={True, False}
Wind (9 Yes, 5 No) Entropy (S Wind)= - 9/14 log2 (9/14) - 5/14 log2 (5/14) =0.94
True ( 3Yes, 3 No) Entropy (Strue)= - 3/6 log2 (3/6) - 3/6log2 (3/6)
= -(0.5 * -1) – (0.5*-1)
= 0.5+.05= 1
False(6 Yes, 2 No) Entropy (Sfalse)= - 6/8 log2 (6/8) - 2/8 log2 (2/8) =
= -(0.75*-0.415)-(0.25*-2)
=0.31125+0.5 = 0.81125
Gain IG(S, Wind) = Entropy(S) - Sum (|si|/|s| Entropy(S) )
= Entropy Wind) – 6/14 Entropy (Strue) – 8/14 Entropy (Sfalse)
= 0.94- (6/14) * 1.0 – (8/14)* 0.81125
=0.94- 0.4286-0.4636
=0.478
ten Boolean attributes of our restaurant problem there are 21024 or about 10308 different functions
An attribute splits the examples E into subsets Ei, each of which needs less information to
complete the classification
Let Ei have pi positive and ni negative examples
For Patrons?, this is 0.459 bits, for Type this is (still) 1 bit choose the attribute that minimizes the
remaining information needed
Linear regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.
y= a0+a1x+ ε
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Classification Algorithm
The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
Types of Classifications
Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Biological Neuron
Biologically, we can also define a neuron. The human body is made up of a vast array of
living cells. Certain cells are interconnected in a way that allows them to communicate pain or to
actuate fibres or tissues. Some cells control the opening and closing of minuscule valves in the
veins and arteries. These specialized communication cells are called neurons. Neurons are
equipped with long tentacle like structures that stretch out from the cell body, permitting them to
communicate with other neurons. The tentacles that take in signals from other cells and the
environment itself are called dendrites, while the tentacles that carry signals from the neuron to
other cells are called axons.
Artificial Neuron
A neural network is a graph, with patterns represented in terms of numerical values
attached to the nodes of the graph and transformations between patterns achieved via simple
message-passing algorithms.
The graph contains a number of units and weighted unidirectional connections between
them. The output of one unit typically becomes an input for another. There may also be units with
external inputs and outputs. The nodes in the graph are generally distinguished as being input
nodes or output nodes and the graph as a whole can be viewed as a representation of a multivariate
functions linking inputs to outputs.
Numerical values (weights) are attached to the links of the graphs, parameterizing the
input/ output function and allowing it to be adjusted via a learning algorithm. A broader view of
neural network architecture involves treating the network as a statistical processor characterized
by making particular probabilistic assumptions about data. Figure illustrates one example of a
possible neural network structure.
Hyperplane
A hyperplane in an n-dimensional Euclidean space is a n-1 dimensional subset of that space that
divides the space into two disconnected parts. Examples show hyperplanes in 2 and 3 dimensions
A Support Vector Machine (SVM) performs binary linear classification by finding the optimal
hyperplane that separates the two classes of instances
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
Nonparametric model
A nonparametric model is one that cannot be characterized by a bounded set of parameters.
Example: suppose that each hypothesis retains within itself all of the training examples and uses
all of them to predict the next example.
Instance-based learning or memory-based learning.
The effective number of parameters is unbounded—it grows with the number of
examples.
Table lookup is the simplest instance-based learning method : take all the training examples, put
them in a lookup table, and then when asked for h(x), see if x is in the table; if it is, return the
corresponding y.
“nearest” implies a distance metric. Distances are measured with a Minkowski distance or Lp
when p=2 this is Euclidean distance and when p=1 it is Manhattan distance.
The number of attributes on which the two points differs with boolean attribute values is called the
Hamming distance.
To keep all dimensions the same, normalizing the measurements in each dimension.
rescale xj,i becomes (xj,i − μi)/σi.
A more complex metric known as the Mahalanobis distance takes into account the covariance between
dimensions.
Nearest neighbors works very well with low-dimensional spaces with plenty of data.
let k=10 and N =1, 000, 000. In two dimensions (n=2; a unit square), the average neighborhood has
_=0.003, a small fraction of the unit square, and in 3 dimensions _ is just 2% of the edge length of the
unit cube. But by the time we get to 17 dimensions, _ is half the edge length of the unit hypercube, and in
200 dimensions it is 94%. This problem has been called the curse of dimensionality.
poor nearest-neighbors fit on outliers, O(N) execution time
Locality-sensitive hashing
Hash tables have the potential to provide even faster lookup than binary trees. But how
can we find nearest neighbors using a hash table, when hash codes rely on an exact match? Hash
codes randomly distribute values among the bins, but we want to have near points grouped
together in the LOCALITY-SENSITIVE same bin; we want a locality-sensitive hash (LSH).
approximate nearneighbors problem: given a data set of example points and a query point xq , find,
with high probability, an example point (or points) that is near xq.
Nonparametric regression
In (a), we have perhaps the simplest method of all, known informally as “connect-the-dots,” and
superciliously as “piecewise linear nonparametric regression.” This model creates a function h(x) that,
when given a query xq, solves the ordinary linear regression problem with just two points: the training
examples immediately to the left and right of xq. When noise is low, this trivial method is
actually not too bad, which is why it is a standard feature of charting software in spreadsheets. But when
the data are noisy, the resulting function is spiky, and does not generalize well. k-nearest-neighbors
regression (Figure 18.28(b)) improves on connect-the-dots. Instead of using just the two examples to the
left and right of a query point xq, we use the k nearest neighbors (here 3). A larger value of k tends to
smooth out the magnitude of the spikes, although the resulting function has discontinuities. In (b), we have
the k-nearestneighbors average: h(x) is the mean value of the k points,_yj/k. Notice that at the outlying
points, near x=0 and x=14, the estimates are poor because all the evidence comes from one
side (the interior), and ignores the trend. In (c), we have k-nearest-neighbor linear regression, which finds
the best line through the k examples. This does a better job of capturing trends at the outliers, but is still
discontinuous. In both (b) and (c), we’re left with the question of how to choose a good value for k. The
answer, as usual, is cross-validation.
Locally weighted regression gives us the advantages of nearest neighbors, without the discontinuities. To
avoid discontinuities in h(x), we need to avoid discontinuities in the set of examples we use to estimate
h(x). The idea of locally weighted regression is that at each query point xq, the examples that are close to
xq are weighted heavily, and the examples that are farther away are weighted less heavily or not at all. The
decrease in weight over distance is always gradual, not sudden. A kernel function looks like a bump; in
Figure 18.29 we see the specific kernel used to generate
Figure 18.28(d). We can see that the weight provided by this kernel is highest in the center and reaches zero
at a distance of •}5. Can we choose just any function for a kernel? No. First, note that we invoke a kernel
function K with K(Distance(xj , xq)), where xq is a query point that is a given distance from xj , and we
want to know how much to weight that distance. So K should be symmetric around 0 and have a maximum
at 0. The area under the kernel must remain bounded as we go to •}∞. Other shapes, such as Gaussians,
have been used for kernels, but the latest research suggests that the choice of shape doesn’t matter much.
We
Agents can handle uncertainty by using the methods of probability and decision theory.
Data
Data are evidence that is, instantiations of some or all of the random variables describing
the domain.
Hypotheses
Hypotheses are probabilistic theories of how the domain works,
The random variable H (for hypothesis) denotes the type of the bag, with possible values
h1 through h5.
Di is a random variable with possible values cherry and lime. Pieces of candy are opened
and inspected, data are revealed—D1, D2, . . ., DN
Task is to predict the flavor of the next piece of candy.
13.1 Bayesian learning
It simply calculates the probability of each hypothesis, given the data, and makes
predictions on that basis.
Let D represent all the data, with observed value d; then the probability of each hypothesis
is obtained by Bayes’ rule:
P(hi | d) = αP(d | hi)P(hi) .
To make a prediction about an unknown quantity X.
P(X | d) = ∑𝑖 𝐏(X | 𝐝, hi)𝐏(hi | 𝐝) =∑𝑖 𝐏(X | hi)P(hi | 𝐝)
the maximum-likelihood hypothesis hML asserts that the actual proportion of cherries in
the bag is equal to the observed proportion in the candies unwrapped so far
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.
Another example. Suppose this new candy manufacturer wants to give a little hint to the
consumer and uses candy wrappers colored red and green. The Wrapper for each candy is selected
probabilistically, according to some unknown conditional distribution, depending on the flavor.
three parameters: θ, θ1, and θ2.
P(Flavor =cherry,Wrapper =green | hθ,θ1,θ2 )
= P(Flavor =cherry | hθ,θ1,θ2)P(Wrapper =green | Flavor =cherry, hθ,θ1,θ2)
= θ ・ (1 − θ1)
rc of the cherries have red wrappers and
gc have green,
while rl of the limes have red
and gl have green.
The likelihood of the data is given by
P(d | hθ,θ1,θ2) = θc(1 − θ)l ・ θ1rc (1 − θ1)gc ・ θ 2rl (1 − θ2)gl
Taking logarithms
L = [c log θ + l log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [rl log θ2 + gl log(1 − θ2)] .
With complete data, the maximum-likelihood parameter learning problem for a Bayesian
network decomposes into separate learning problems, one for each parameter
The parameters of this model are the mean μ and the standard deviation σ.
Let the observed values be x1, . . . , xN. Then the log likelihood is
the beta family has a property: if Θ has a prior beta[a, b], then, after a data point is observed, the
posterior distribution for Θ is also a beta distribution. In other words, beta is closed under update.
The beta family is called the conjugate prior for the family of distributions for a Boolean variable
where d is the number of dimensions in x and D is the Euclidean distance function. A good value
of w can be chosen by using cross-validation.
(a) A simple diagnostic network for heart disease, which is assumed to be a hidden variable. Each
variable has three possible values and is labeled with the number of independent parameters in its
conditional distribution; the total number is 78. (b) The equivalent network with HeartDisease
removed. Note that the symptom variables are no
longer conditionally independent given their parents. This network requires 708 parameters.
(a) A Gaussian mixture model with three components; the weights (left-toright) are 0.2, 0.3, and
0.5. (b) 500 data points sampled from the model in (a). (c) The model reconstructed by EM from
the data in (b).
parameters of a mixture of Gaussians are
wi =P(C =i) (the weight of each component)
μi (the mean of each component)
Σi (the covariance of each component)
The basic idea of EM in this context is to pretend that we know the parameters of the model
and then to infer the probability that each data point belongs to each component. After that, we
refit the components to the data, where each component is fitted to the entire data set with each
point weighted by the probability that it belongs to that component. The process iterates until
convergence.
For the mixture of Gaussians, we initialize the mixture-model parameters arbitrarily and
then iterate the following two steps:
E-step: Compute the probabilities pij =P(C =i | xj), the probability that datum xj was
generated by component i. By Bayes’ rule, we have pij =αP(xj |C =i)P(C =i). ni =Σj pij, the
effective number of data points currently assigned to component i.
M-step: Compute the new mean, covariance, and component weights
There are two points to notice. First, the log likelihood for the final learned model slightly exceeds
that of the original model, from which the data were generated
The second point is that EM increases the log likelihood of the data at every iteration.
(a) A mixture model for candy. The proportions of different flavors, wrappers, presence of holes depend on the bag,
which is not observed. (b) Bayesian network for a Gaussian mixture. The mean and covariance of the observable
variables X depend on the component C.
the parameter updates for Bayesian network learning with hidden variables are directly available from
the results of inference on each example. Moreover, only local posterior probabilities are needed for each
parameter.
The expected counts are computed by an HMM inference algorithm. The forward–backward algorithm
can be modified very easily to compute the necessary probabilities. probabilities required are obtained by
smoothing rather than filtering
States s ∈ S, actions a ∈ A
● Model T(s, a, s′) ≡ P(s ′ ∣s, a) = probability that a in s leads to s ′
Regular MDP
– Given:
• Transition model P(s’ | s, a)
• Reward function R(s)
– Find:
• Policy (s)
• Reinforcement learning
– Transition model and reward function initially unknown
– Still need to find the right policy
– “Learn by doing”
Reinforcement learning strategies
• Model-based
– Learn the model of the MDP (transition probabilities and rewards) and try to
solve the MDP concurrently
• Model-free
– Learn how to act without explicitly learning the transition probabilities P(s’ | s, a)
– Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing
action a in state s
Model-based reinforcement learning
• Basic idea: try to learn the model of the MDP (transition probabilities and rewards) and
learn how to act (solve the MDP) simultaneously
• Learning the model:
– Keep track of how many times state s’ follows state s when you take action a and
update the transition probability P(s’ | s, a) according to the relative frequencies
– Keep track of the rewards R(s)
• Learning how to act:
– Estimate the utilities U(s) using Bellman’s equations
– Choose the action that maximizes expected future utility:
* (s) = arg max P(s ' | s,a)U(s ')
aA(s) s'
Exploration vs. exploitation
• Exploration: take a new action with unknown consequences
explore more in the beginning, become more and more greedy over time
Standard (“greedy”) selection of optimal action:
a = arg max P(s ' | s,a ')U(s ')
a 'A(s) s'
IT6801/CI UNIT 4 Prepared by J. ANNROSE Page 45
Modified Strategy
a = arg max f P(s ' | s, a ')U(s '), N(s, a ')
a 'A(s) s'
R + if n N e
f (u, n) =
u otherwise
– Pros:
• Get a more accurate model of the environment
• Discover higher-reward states than the ones found so far
– Cons:
• When you’re exploring, you’re not maximizing your utility
• Something bad might happen
• Exploitation: go with the best strategy found so far
– Pros:
• Maximize reward as reflected in the current utility estimates
• Avoid bad stuff
– Cons:
• Might also prevent you from discovering the true optimal strategy
•
• At each time step t
• From current state s, select an action a:
a = arg maxa ' f (Q(s, a' ), N (s, a' ))
– Get the successor state s’
– Perform the TD update: