Machine Learning-5
Machine Learning-5
B.Tech.
Books
Overview of supervised learning T1/Ch. 2
K-nearest neighbour T1/Ch. 2.3.2, R2/Ch.8.2
Multiple linear regression T1/Ch. 3.2.3
Shrinkage methods (Ridge regression, Lasso regression) T1/Ch. 3.4
Logistic regression T1/Ch. 4.3
●
T1: T Hastie, R.Tibshirani and J Friedman, The Elements of Statistical Linear Discriminant Analysis T1/Ch. 4.4
Learning – Data Mining Inference and Prediction, 2 nd Edition, Springer Feature selection
Module II
T1/Ch. 5.3
Model Accuracy
M1 60%
M2 62%
M3 55%
●
Boosting is a method of combining weak models M1, M2, M3 and so as to yield a
higher accuracy model (say, 80%).
Boosting
●
Consider a two class problem with output variable coded as .
●
Given a data set D = {(xi,yi)} where i = 1,2,3,....N of patterns,
●
A classifier produces a prediction taking one of the two values {-1,1}.
●
Then the error rate on the training sample is
●
The purpose of boosting is to sequentially apply the weak classification algorithm
to repeatedly modified versions of the data which produces a sequence of weak
classifiers
Boosting
●
The predictions from all of them are then combined through a weighted majority
vote to produce the final prediction:
●
In chess, the game-player is the decision maker(Agent) and the environment is the
board.
●
At any time, the environment is in a certain state that is one of a set of possible
states—for example, the state of the board in chess playing game.
●
The decision maker(Agent) has a set of actions possible-for example, legal
movement of pieces on the chess board in chess playing game.
What is Reinforcement Learning ?
●
Once an action is chosen and taken, the state changes.
●
The solution to the task requires a sequence of actions, and we get feedback, in the
form of a reward rarely, generally only when the complete sequence is carried out.
●
The learning agent learns the best sequence of actions to solve a problem
– where “best” is quantified as the sequence of actions that has the maximum
cumulative reward.
●
Such is the setting of reinforcement learning.
●
It is learning to make good decisions under uncertanity.
Challenges in Reinforcement Learning
●
Simulation and setup the environment
●
Scaling and tweeking the neural network controlling the agent
●
There does not exist any mode of communication to the network other than reward
and penalties.
Difference between ML, DL, and RL
●
ML is a form of AI in which computers are given the ability to
progressively(supervised and supervised) improve the performance of a specific
task with data.
●
DL models consists of few neural network layers which gradually learn more
abstract features from data.
●
RL employs a system of rewards and penalities and compels the agent to solve a
problem by itself.
●
Fundamental challenge in AI and ML is learning to make good decisions under
uncertainty.
Reinforcement Learning Involves
●
Optimization
●
Delayed consequences
●
Exploration
●
Generalization
Optimization
●
Goal is to find an optimal way to make decisions
– Yielding best outcomes or at least very good outcomes
●
Explicit notion of utility of decisions
●
Example: finding minimum distance route between two cities given network of
roads
Delayed Consequences
●
Decisions now can impact things much later...
– Saving for retirement
– Health insurance
●
Introduces two challenges
– When planning: decisions involve reasoning about not just immediate benefit
of a decision but also its longer term ramifications
– When learning: temporal credit assignment is hard (what caused later high or
low rewards?)
– Temporal credit assignment problem – How do the agent figure out the causal
relationship between the decisions it made in the past and the outcome in
future.
Exploration
●
Learning about the world by making decisions
– Agent as scientist
– Learn to ride a bike by trying (and failing)
●
Decisions impact what we learn about
– If you choose to go to IIT instead of SIT, you will have different later
experiences...
Generalization
●
Policy is mapping from past experience to action.
●
Why not just pre-program a policy?
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
Imitation Learning
●
Use experience to guide future decisions.
●
Sequential decision making under uncertainty
Modeling Real-World Problem
●
Let S: Set of possible states of the model
st : sequence of states with respect to time.
A : Set of possible actions
●
Agent’s representation of how world changes given agent’s action
●
Transition / dynamics model predicts next agent state
P(st+1 = s’|st = s, at = a)
●
Reward model predicts immediate reward
r(st = s, at = a) = E(rt|st = s, at = a)
Example: Mars Rover Stochastic Markov Model
Policy
●
It is a function π: S →A that maps agent’s states to action. i.e. it determine which
action should be taken in state s.
●
It is the cumulative sum of future rewards obtained by the agent starting from the
state “s” and policy “π”.
●
γ: it is the discount factor
●
Can be used to quantify goodness/badness of states and actions
Example: Mars Rover Value Function
Markov Process
●
A Markov process is a stochastic process that satisfies the Markov property,
because of which we say that a Markov process is memoryless.
●
Assumptions:
– Finite state space : The state space of the Markov process is fnite. This means
that for the Markov process (s0, s1, s2, . . .), there is a state space S with |S| < ∞
– Stationary transition probabilities : The transition probabilities are time
independent. Mathematically, this means the following:
P(si = s’|si-1 = s) = P(sj = s’|sj-1 = s) , ∀ s, s’ ∈ S , ∀ i, j = 1, 2, . . . .
●
A Markov process satisfying these assumptions is called “Markov Chain”
Markov Process
●
A Markov process is defined by the tuple(S,P) where
– S: A finite state space
– P: Transition probability P(st+1= s’|st = s)
The matrix P is a non-negatve row-stochastic matrix, i.e. the sum of each row
=1
– If finite number (N) of states, can express P as a matrix
Example: Mars Rover Markov Chain Transition Matrix, P
MRP:Markov Reward Process
●
Example: Mars Rover Markov Chain Episodes
– Let Mars rover is at s4
– Example: Sample episodes starting from s4
– s4,s5,s6,s7,s7,s7, . . .
– s4,s3,s2,s1,s2,s3,s4,s5 . . .
– s4,s4,s5,s6,s6 . . .
MRP:Markov Reward Process
●
Markov Reward Process is a Markov Chain + rewards
●
Definition of Markov Reward Process (MRP) is a 4 tuple (S,P,R,γ) where
– S is a (finite) set of states (s ∈ S)
– P is dynamics/transition probability that specifices P(st+1 = s’|st = s)
– R is a reward function that maps states to rewards (real numbers),
i.e R : S → R
– Discount factor between 0 and 1 i.e. γ ∈ [0, 1]
MRP:Markov Reward Process
●
Reward function:
– In a Markov reward process, whenever a transition happens from a current
state s to a successor state s’ , a reward is obtained depending on the current
state s.
– Thus for the Markov process (s0, s1, s2, . . .),each transition si → si+1 is
accompanied by a reward ri for all i = 0, 1, . . .
– A particular episode of the Markov reward process is represented as (s0, r0, s1,
r1, s2, r2, . . .)
MRP:Markov Reward Process
●
Reward function:
– Rewards can be either deterministic or stochastic.
– In the deterministic case, mathematically this means that for all realizations of
the process we must have that: ri = rj , whenever si = sj ∀ i, j = 0, 1, . . . ,
– Example: +1 in s1
+10 in s7
0 in others
MRP:Markov Reward Process
●
Horizon(H) :
– The horizon H of a Markov reward process is defined as the number of time
steps in each episode (realization) of the process.
– The horizon can be finite or infinite.
– If the horizon is finite, then the process is also called a finite Markov reward
process.
●
Return(Gt):
– The return Gt of a Markov reward process is defined as the discounted sum of
rewards starting at time t up to the horizon H
– It is given by the following mathematical formula:
MRP:Markov Reward Process
●
State value function(Vt(s)) :
– The state value function Vt(s) for a Markov reward process and a state s ∈ S is
defined as the expected return starting from state s at time t, and is given by the
following expression
–
●
Discount factor(γ):
– if the horizon is infinite and γ = 1 ,then the return can become infinite even if
the rewards are all bounded.
– If this happens, then the value function V(s) can also become infinite.
– Such problems cannot then be solved using a computer.
– To avoid such mathematical difficulties and make the problems computationally
tractable we set γ < 1 .This quantity γ is called the discount factor .
MRP:Markov Reward Process
●
Discount factor(γ):
– Other than for purely computational reasons, it should be noted that humans
behave in much the same way -
●
we tend to put more importance in immediate rewards over rewards
obtained at a later time.
– When γ = 0 , we only care about the immediate reward
– When γ = 1 , we put as much importance on future rewards as compared the
present.
– if the horizon of the Markov reward process is nite, i.e. H < ∞ , then we can set
γ = 1 , as the returns and value functions are always finite.
MRP:Markov Reward Process
●
Example of a MRP: Mars Rover
– Given set of ststes S = {S1,S2,S3,S4,S5,S6,S7}, Discount factor =.5, rewards
r1=1, r2=r3=r4=r5=r6=0,r7 =10, and horizon H=4. Compute the return G0 of
the following episodes.
– S4, S5, S6, S7, S7: G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 10 = 1.25
– S4, S4, S5, S4, S5 : G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 0 = 0
– S4, S3, S2, S1, S2 : G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 1 = 0.125
How to compute the value function of a MRP?
●
There are three dierent ways to compute the value function of a Markov reward
process:
– Simulation
– Analytic solution
– Iterative solution
How to compute the value function of a MRP?
●
Monte Carlo simulation:
– Generate a large number(N) of episodes starting with state “s” and time “t”.
– Compute Gt , γ t
– Mean Gt = ΣGt / N
– For a Markov reward process M = (S, P, R, γ), state “s”, time “t”, and the number of
simulation episodes N, the pseudo-code of the simulation algorithm is given as below:
How to compute the value function of a MRP?
●
Iterative solution for finite horizon:
– Dynamic programming based solution
–
– For both these algorithms , the computational cost of each loop is O(|S|2)
Markov Decision Process(MDP)
●
It is formally represented using the tuple (S, A, P, R, γ) which are listed below:
– S : A finite state space.
– A : A finite set of actions which are available from each state
– P : A transition probability model that specifies P (s’|s, a) .
– R :A reward function that maps a state-action pair to rewards (real numbers),
i.e. R : S×A → R .
– γ : Discount factor γ ∈ [0, 1] .
Multi-Armed Bandit
●
It is a classic reinforcement learning problem which signifies the exploration-
exploitation tradeoff dilemma.
●
K-armed bandit / N-armed bandit
●
It is a problem in which a limited set of resources must be allocated between
competiting choices in a way that maximizes their expected gain, when each
choice’s properties are only partially known at the time of allocation, and may
become better understand as time passes or by allocating resources to the choice.
●
It is also categorized as stochastic scheduling.
Multi-Armed Bandit - Application
●
How must a given budget be distributed among some research departments (where
their outcomes are particularlly known) to maximize results?
●
How to design effective financial portfolio?
●
How to design adaptive routing to maximize end-to-end dely, delay jitter,
propagation delay etc. ?
●
How to execute effective investigation of clinical trials of different experimental
treatments while minimizing patient losses?
●
How to balance EXPLOITATION vs EXPLORATION tradeoff?
Modeling Multi-Armed Bandit Model (MAB)
●
It is a set of real-distributions:
– B = {R1,R2,....,Rk} , k∈N+ livers
μ1,μ2,μ3,.....,μk - Mean values associated with these reward distributions.
●
The gambler iteratively plays one liver per round and observes the associated
reward.
●
Objective : Maximize the sum of the collected rewards.
●
Horizon(H): The number of rounds that remain to be played.
Modeling Multi-Armed Bandit Model (MAB)
●
MAB ≡ One-state Markov Decision Process
●
: Regret after T rounds
●
●
Where : Maximal reward mean =
●
: is the reward in round t.
●
Defination (Zero-Regret Strategy): It is a strategy whose average regret per round
, with probability = 1
Bandit Strategies
●
1. Optimal solutions:
UCBC[Upper Confidence Bounds with Cluster]
– The algorithm incorporates the clustering information by playing at two levels:
●
First picking a cluster using a UCB -like strategy at each time step
●
Subsequently picking an arm within the cluster, again using a UCB-like
strategy
Bandit Strategies
●
2. approximate Solutions:
– Epsilon-greedy strategy
– Epsilon-first strategy
– Epsilon-decreasing strategy
– Adaptive epsilon-greedy strategy based on value differences(VDBE)
– Adaptive Epsilon-greedy strategy based on Baysian ensembles (Epsilon-BMC)
– Contextual-Epsilon-greedy strategy
Contextual Bandit
●
It is a generalization of the multi-armed bandit.
●
At each iteration an agent has to choose between arms, by referring the d-
dimensional feature vector.
●
Approximate solutions are
– Online linear Bandits
●
LinUCB(Upper Confidence Bandit)
●
LinRel(Linear Associative Reinforcement Learning)
– Online non-Linear Bandits
●
UCB...
●
Generalized Linear algorithm
●
Neural Bandit algorithm
●
Kernel UCB algorithm
●
Bandit forest algorithm
●
Oracle-based algorithm
Adversarial Bandit
●
Another variant of multi-armed bandit problem is called the adversial bandit.
●
Working Principle:
●
At each iteration, an agent chooses an arm and an adversary simultaneously
chooses the payoff structure for each arm.
●
Example:(iterated prisoner’s dilema)
– Each adversary has two arms to pull
– They can either deny or confess.
– Standard stochastic bandit algorithms did not work very well with these
iterations
Adversarial Bandit
– Example : if,(the opponent cooperates in the first 10 rounds, defects for the
next 20, then cooperate in the following 30, etc.)
– After a certain point sub-optimal arms are rarely pulled to limit exploration
and focus on exploitation
– When the environment changes the algorithm is unable to adapt or may not
even detect the change.
●
Note: it is one of the strongest generalizations of the bandit problem as it removes
all assumptions of the distribution and a solution to the adversial bandit problem.
Pseudocode: Appropriate solutions of Adversarial Bandit
●
Input: γ∈(0,1], V∈R
●
Initialization: wi(1) = 1 ∀ i=1 to k
●
For each t=1,2,3,...T
1. set , i = 1 to k
2. Draw “it” randomly according to the probabilities p1(t),p2(t),...,pk(t)
3. Reeive reward ∈(0,1]
4. for j=1,2,3.....,k
, if j = it
0 , Otherwise
Pseudocode: Appropriate solutions of Adversarial Bandit
●
Explanation:
– The algorithm chooses an arm at random with probability (1-γ) it prefers
arms with higher weights - Exploitation
Maintains weights for each arm to Doesn’t need to know the pulling
calculate pulling probability probability for each arm
It is computionally expensive(Calculation
Computationally efficient
for each arm)