0% found this document useful (0 votes)
26 views

Machine Learning Unit4

Uploaded by

Bhavy Maniya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
26 views

Machine Learning Unit4

Uploaded by

Bhavy Maniya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
Markov Decision Process Reinforcement Learning : Reinforcement Learning is a type of Machine Learning It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the reinforcement signal. ‘There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best action to select based on his current state. When this step is repeated, the problem is known as a Markov Decision Process. ‘A Markov Decision Process (MDP) model contains: * Aset of possible world states S. + Aset of Models. * Asset of possible actions A. + Areal-valued reward function R(s,a) * Apolicy the solution of Markov Decision Process. Rec Mod T(S, a, 8S’) ~ P(S MeOH A(S), A Reward: R(S), R(S, a), R(S What is a State? ‘A State is a set of tokens that represent every state that the agent can be in. What is a Model? A Model (sometimes called Transition Model) gives an action's effect in a state. In particular, T(S, a, 8!) defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (Sand S’ may be the same). For stachastie actions (noisy, non-deterministic) we also define a probability P(S'IS,a) which represents the probability of reaching a state S’ if action ‘a’ is taken instate S. Note ‘Markov property states that the effects of an action taken in a state depend only on that state and not on the prior history. What are Actions? An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being in state S. What is a Reward? ‘A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S") indicates the reward for being in a state S, taking an action ‘a’ and ending up ina state S’ What is a Policy? Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates the action ‘a! to be taken while in state S. Let us take the example of a grid world 3 2 fa ‘An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no 1,1). The purpose of the agent is to wander around the grid to finally reach the Blue Diamond (grid no 4,3). Under all circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the agent cannot enter it. ‘The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT Walls block the agent path, i.,, if there is a wall in the direction the agent would have taken, the agent stays in the same place. So for example, if the agent says LEFT in the START grid he would stay put in the START grid. First Aim: To find the shortest sequence getting from START to the Diamond. Two such sequences can be found: « RIGHT RIGHT UP UPRIGHT + UP UP RIGHT RIGHT RIGHT Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion. The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the action agent takes causes it to move at right angles. For example, if the agent says UP the probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP). The agent receives rewards each time step:~ + Small reward each step (can be negative when can also be term as punishment, in the above example entering the Fire can have a reward of -1) + Big rewards come at the end (good or bad). * The goal is to Maximize the sum of rewards. Bellman Equation According to the Bellman Equation, long-term - reward in a given action is equal to the reward from the current action combined with the expected reward from the future actions taken at the following time. Let's try to understand first. Let’s take an example: Here we have a maze which is our environment and the sole goal of our agent is to reach the trophy state (R = 1) or to get Good reward and to avoid the fire state because it will bea failure (R = -1) or will get Bad reward. Initially, we will give our agent some time to explore the environment and let it figure out a path to the goal. As soon asit reaches its goal, it will back trace its steps back to its starting position and mark. values of all the states which eventually leads towards the goal as V = 1. ‘The agent will face no problem until we change its starting position, as it will not be able to find a path towards the trophy state since the value of all the states is equal to 1. So, to solve this problem we should use Bellman Equation: V(s)=maxa(R(s,a)+ yV(s")) State( ): current state where the agent is in the environment Next State(s’): After taking action(a) at state(s) the agent reaches s’ Value(V): Numeric representation of a state which helps the agent to find its path. V(s) here means the value of the states, Reward(R): treat which the agent gets after performing an action(a). + R(s): reward for being in the state s + R(s,a): reward for being in the state and performing an action a + R(s,a,s"): reward for being in a state s, taking an action a and ending up ins’ e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0. ‘Action(a): set of possible actions that can be taken by the agent in the state(s). e.g. (LEFT, RIGHT, UP, DOWN) Discount factor(y): determines how much the agent cares about rewards in the distant future relative to those in the immediate future. It has a value between o and 1. Lower value encourages short-term rewards while higher value promises long-term reward Dynamic programming * main idea —use value functions to structure the search for good policies —need a perfect model of the environment Etwo main components — policy evaluation: compute V: from 1 — policy improvement: improve nm based on V« — start with an arbitrary policy —repeat evaluation/improvement until convergence Policy evaluation/improvement * policy evaluation: 1 -> V« - Bellman eqn's define a system of n eqn's — could solve, but will use iterative version Vee 6s) = Drs > Ba, [roy + Vela’), — start with an arbitrary value function V,, iterate until V, converges * policy improvement: V«-> 1’ a{«) = arg max Q)* (s,a) ee = argmax LP hy +4V"(s y — 1 either strictly better than n, or 1’ is optimal (if =7tt’) Policy/Value iteration . Polinw jteratian ——— O° yr, ma” yr! |, =! at a yt —two nested iterations; too slow - don't need to converge to V™* + just move towards it 0 nln faensian Vig) = mgr hee [ry + r¥els') — use Bellman optimality equation as an update — converges to V* There are two different approaches to evaluate the policy by using Monte-Carlo. First-Visit and Every-Visit. For both, we keep a counter N(s) for each state s and we save the sum of the different returns from each episode for each state in S(s). N(s) and S(s) accumulate values until all K episodes are completed. They are not set to zero after each episode. Let’s have a closer look at both approaches. First-Visit Monte-Carlo Policy Evaluation We do the same as for the Every- Time Monte-Carlo Policy Evaluation. The only difference here is that in the same episode we update N(s) and the S(s) only on the first visit of the state s. In our algorithm, we will check if s was visited for the first time in the episode and only then update them. Initialize: a < policy to be evaluated v + an arbitrany state-value function S(s) < 0, foralls € S$ N(s) +0 Repeat for K episodes: (a) generate an episode using 7 (b) For each state in the episode: if (first visit of s) { N(s) + N(s)+1 S(s) «+ S(s)+G; } S(s) (c) V(s) & ney foralls eS Every-Visit Monte-Carlo Policy Evaluation We iterate through K episodes. In each episode, every time we reach a state s we increment the counter N(s) for that state by one. It is possible that the same state is reached multiple times in the same episode. For all those visits the counter is increased. N(s) — N(s) +1 On all those visits we also update the current S(s) value by adding the return G_t of the current episode k, starting from the state s to the terminal state. For each visit, G_t can be different. S(s) + S(s) + G; After having done that we update our current value for the state by doing the following: S(s) V(s) <— NG) Which is just the average of all the returns of all episodes till now. The algorithm looks like the following: Initialize: m < policy to be evaluated v + an arbitrany state-value function S(s) — 0, foralls eS N(s) — 0 Repeat for K episodes: (a) generate an episode using 7 (b) For each state in the episode: N(s) «+ N(s)+1 S(s) + S(s)+ G; (0) V(s) — He} forall se Q-Learning in Python Pre-Requisite : Reinforcement Learning Reinforcement Learning briefly is a paradigm of Learning Process in which a learning agent learns, overtime, to behave optimally in a certain environment by interacting continuously in the environment. The agent during its course of learning experience various different situations in the environment it is in. These are called states. The agent while being in that state may choose from a set of allowable actions which may fetch different rewards(or penalties). The learning agent overtime learns to maximize these rewards so as to behave optimally at any given state itis in. Q- Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent. 1. Q- Values or Action- Values: Q-values are defined for states and actions. Q ( 8, A ) is an estimation of how good is it to take the action Aatthe state 5. This estimation of AS, A) wit be iteratively computed using the TD- Update rule which we will see in the upcoming sections. 2, Rewards and Episodes: An agent over the course of its lifetime starts from a start state, makes a number of transitions from its current state to a next state based on its choice of action and also the environment the agent is interacting in. At every step of transition, the agent from a state takes an action, observes a reward from the environment, and then transits to another state. If at any point of time the agent ends up in one of the terminating states that means there are ne further transition possible. This is said to be the completion of an episode. 3. Temporal Difference or TD-Update: The Temporal Difference or TD-Update rule can be represented as follows : QS, A) = Q(S,A) +a (R + 7Q(S',A') - Q(5,A)) This update rule to estimate the value of Q is applied at every time step of the agents interaction with the environment. The terms used are explained below. : 2 S: Current State of the agent. A: Current Action Picked according to some policy. oS": Next State where the agent ends up. o Al: Next best action to be picked using current O-value estimation, i.e. pick the action with the maximum Q-value in the next state. o R: current Reward observed from the environment in Response of current action. 2 Yeo and <=1): Discounting Factor for Future Rewards. Future rewards are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well. © Q: Step length taken to update the estimation of Q(S, A). 4. Choosing the Action to take using €-greedy policy: €-greedy policy of is a very simple policy of choosing actions using the current Q-value estimations. It goes as follows : © With probability ( l-e ) choose the action which has the highest Q-value. ° With probability (e) choose any action at random. o Q-Learning: e Qlearning is an Off policy RL algorithm, which is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions. o It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s.’ o The below flowchart explains the working of Q- learning: o State Action Reward State action (SARSA): o SARSA stands for State Action Reward State action, which is an_ on-policy temporal difference learning method. The on-policy control method selects the action for each state while learning using a specific policy. The goal of SARSA is to calculate the Q t (s, a) for the selected current policy nm and all pairs of (s-a). The main difference between Q-learning and SARSA algorithms is that unlike Q- learning, the maximum reward for the next state is not required for updating the Q-value in the table. In SARSA, new action and reward are selected using the same policy, which has determined the original action. The SARSA is named because it uses the quintuple Q(s, a, r, s', a’). Where, s: original state a: Original action r: reward observed while following the states s' and a: New state, action pair. ° ° fo) ° oc Qlearning is a popular model-free reinforcement learning algorithm based on the Bellman equation. o The main objective of Q-learning is to learn the policy which can inform the agent that what actions should be _ taken for maximizing the reward under what circumstances. o It is an off-policy RL that attempts to find the best action to take at a current state. o The goal of the agent in Qlearning is to maximize the value of Q. o The value of Q-learning can be derived from the Bellman equation. Consider the Bellman equation given below: V(s) = max [Rls,a) + vse Pls, a, s')V(5°)] old 2 value/policy acting planning direct RL model experience model learning Model-based Reinforcement Learning refers to learning optimal behavior indirectly by learning a model of the environment by taking actions and observing the outcomes that include the next state and the immediate reward. The models predict the outcomes of actions and are used in lieu of or in addition to interaction with the environment to learn optimal policies. Model-based techniques Below, model-based algorithms are grouped into four categories to highlight the range of uses of predictive models. For the comparative performance of some of these approaches in a continuous control setting, this benchmarking paper is highly recommended. Analytic gradient computation Assumptions about the form of the dynamics and cost function are convenient because they can yield closed-form solutions for locally optimal control, as in the LQR framework. Even when these assumptions are not valid, receding-horizon control can account for small errors introduced by approximated dynamics. Similarly, dynamics models parametrized as Gaussian processes have analytic gradients that can be used for policy improvement. Controllers derived via these simple parametrizations can also be used to provide guiding samples for training more complex nonlinear policies. Sampling -based planning In the fully general case of nonlinear dynamics models, we lose guarantees of local optimality and must resort to sampling action sequences. The simplest version of this approach, random shooting, entails sampling candidate actions from a fixed distribution, evaluating them under a model, and choosing the action that is deemed the most promising. More sophisticated variants iteratively adjust the sampling distribution, as in the cross- entropy method (CEM; used in PlaNet, PETS, and Model-based data generation An important detail in many machine learning success stories is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the Dyna algorithm by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. This strategy has been combined with iLQG, model ensembles, and meta-learning; has been scaled to image observations; and is amenable to theoretical analysis. A close cousin to model-based data generation is the use of a model to improve target value estimates for temporal difference learning. Value - equivalence prediction A final technique, which does not fit neatly into model-based versus model-free categorization, is to incorporate computation that resembles model- based planning without supervising the model’s predictions to resemble actual states. Instead, plans under the model are constrained to match trajectories in the real environment only in their predicted cumulative reward. These value-equivalent models have shown to be effective in high-dimensional observation spaces where conventional model-based planning has proven difficult.

You might also like