This document introduces the deep reinforcement learning model 'A3C' by Japanese.
Original literature is "Asynchronous Methods for Deep Reinforcement Learning" written by V. Mnih, et. al.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
Deep reinforcement learning algorithms often fail to learn complex tasks. Recent works have identified three issues that form a "deadly triad" contributing to this problem: non-stationary targets, high variance, and positive correlation. New algorithms aim to address these issues by improving exploration, stabilizing learning, and decorrelating updates. Overall, deep reinforcement learning remains a challenging area with opportunities to develop more data-efficient and generally applicable algorithms.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
This document provides an overview of POMDP (Partially Observable Markov Decision Process) and its applications. It first defines the key concepts of POMDP such as states, actions, observations, and belief states. It then uses the classic Tiger problem as an example to illustrate these concepts. The document discusses different approaches to solve POMDP problems, including model-based methods that learn the environment model from data and model-free reinforcement learning methods. Finally, it provides examples of applying POMDP to games like ViZDoom and robot navigation problems.
This document discusses self-supervised representation learning (SRL) for reinforcement learning tasks. SRL learns state representations by using prediction tasks as an auxiliary objective. The key ideas are: (1) SRL learns an encoder that maps observations to states using a prediction task like modeling future states or actions; (2) The learned state representations improve generalization and exploration in reinforcement learning algorithms; (3) Several SRL methods are discussed, including world models, inverse models, and causal infoGANs.
The document discusses control as inference in Markov decision processes (MDPs) and partially observable MDPs (POMDPs). It introduces optimality variables that represent whether a state-action pair is optimal or not. It formulates the optimal action-value function Q* and optimal value function V* in terms of these optimality variables and the reward and transition distributions. Q* is defined as the log probability of a state-action pair being optimal, and V* is defined as the log probability of a state being optimal. Bellman equations are derived relating Q* and V* to the reward and next state value.
This slide introduces the model which is one of the deep Q network. Dueling Network is the successor model of DQN or DDQN. You can easily understand the architecture of Dueling Network.
This document presents mathematical formulas for calculating gradients and updates in reinforcement learning. It defines a formula for calculating the gradient of a value function with respect to its parameters, a formula for calculating the gradient of a policy based on the reward and value, and a formula for calculating the gradient of a parameter vector that is a weighted combination of its previous value and the policy gradient.
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
This document presents a model-free, off-policy actor-critic algorithm to learn policies in continuous action spaces using deep reinforcement learning. The algorithm is based on deterministic policy gradients and extends DQN to continuous action domains by using deep neural networks to approximate the actor and critic. Challenges addressed include ensuring samples are i.i.d. by using a replay buffer, stabilizing learning with a target network, normalizing observations with batch normalization, and exploring efficiently with an Ornstein-Uhlenbeck process. The algorithm is able to learn policies on high-dimensional continuous control tasks.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
This document provides an overview of POMDP (Partially Observable Markov Decision Process) and its applications. It first defines the key concepts of POMDP such as states, actions, observations, and belief states. It then uses the classic Tiger problem as an example to illustrate these concepts. The document discusses different approaches to solve POMDP problems, including model-based methods that learn the environment model from data and model-free reinforcement learning methods. Finally, it provides examples of applying POMDP to games like ViZDoom and robot navigation problems.
This document discusses self-supervised representation learning (SRL) for reinforcement learning tasks. SRL learns state representations by using prediction tasks as an auxiliary objective. The key ideas are: (1) SRL learns an encoder that maps observations to states using a prediction task like modeling future states or actions; (2) The learned state representations improve generalization and exploration in reinforcement learning algorithms; (3) Several SRL methods are discussed, including world models, inverse models, and causal infoGANs.
The document discusses control as inference in Markov decision processes (MDPs) and partially observable MDPs (POMDPs). It introduces optimality variables that represent whether a state-action pair is optimal or not. It formulates the optimal action-value function Q* and optimal value function V* in terms of these optimality variables and the reward and transition distributions. Q* is defined as the log probability of a state-action pair being optimal, and V* is defined as the log probability of a state being optimal. Bellman equations are derived relating Q* and V* to the reward and next state value.
This slide introduces the model which is one of the deep Q network. Dueling Network is the successor model of DQN or DDQN. You can easily understand the architecture of Dueling Network.
This document presents mathematical formulas for calculating gradients and updates in reinforcement learning. It defines a formula for calculating the gradient of a value function with respect to its parameters, a formula for calculating the gradient of a policy based on the reward and value, and a formula for calculating the gradient of a parameter vector that is a weighted combination of its previous value and the policy gradient.
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
This document presents a model-free, off-policy actor-critic algorithm to learn policies in continuous action spaces using deep reinforcement learning. The algorithm is based on deterministic policy gradients and extends DQN to continuous action domains by using deep neural networks to approximate the actor and critic. Challenges addressed include ensuring samples are i.i.d. by using a replay buffer, stabilizing learning with a target network, normalizing observations with batch normalization, and exploring efficiently with an Ornstein-Uhlenbeck process. The algorithm is able to learn policies on high-dimensional continuous control tasks.
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...John Liu
In recent years, machine learning and reinforcement learning algorithms have revolutionized how we tackle problems in pattern recognition, inference and prediction. These learning algorithms are inherently stochastic in nature and collaborative by design. While powerful, they often lead to models that exhibit fragility in noisy real-world domains. A new generation of learning algorithms are evolving to augment robustness by embracing adversarial reasoning. In place of cooperative learning, these algorithms espouse game theoretic concepts of competition, deception, and Nash equilibria. In this talk, John will examine the role of adversarial reasoning in problem solving. Attendees will learn about the principles underpinning adversarial reasoning and their relevance to the new generation of machine learning algorithms including actor-critic A3C methods, generative adversarial networks, and variational autoencoders. In the end, the objective of this talk is to provide an intuitive understanding of the coming learning algorithms that can surmise intent, detect and practice deception, and formulate long-range winning strategies to real world problems.
2024/8/3にLINE ヤフーにて開催されたcv勉強会@関東での大政孝充(株式会社ウェブファーマー)の発表資料です。
Presentation material by Takamitsu Omasa(webfarmer.ltd) at the cv study group @ Kanto held at LINE Yahoo on 8/3/2024.
The document discusses a probabilistic U-Net model for segmenting ambiguous images. It introduces a probabilistic U-Net that combines a U-Net with a variational autoencoder (VAE) to model segmentation as a probabilistic inference task. The probabilistic U-Net uses the U-Net to encode an input image into a latent space and decode a segmentation map, while the VAE models the latent space as a probability distribution to account for ambiguity. It is evaluated on the CityScapes dataset for segmenting lung abnormalities and other medical images.
This document summarizes two papers on semantic segmentation: DenseASPP for semantic segmentation of street scenes using DenseNet and ASPP, and Cycle-Shape-GAN for translating and segmenting multimodal medical volumes using a Cycle-GAN with added shape consistency losses. It provides code links and descriptions of the key techniques used in each paper, including DenseASPP, ASPP, Cycle-GAN, and the added shape losses in Cycle-Shape-GAN. Results are shown applying these methods to semantic segmentation of CityScapes street scenes and multimodal medical image segmentation.
The document summarizes a Chainer meetup discussing Cycle-GAN and Cycle-Shape-GAN, generative models for image-to-image translation and segmentation. It describes a Cycle-Shape-GAN implementation in Chainer for translating and segmenting multimodal medical volumes. The meetup covered the Cycle-GAN and Cycle-Shape-GAN models, a github repository with the Chainer implementation, and results segmenting MRI and CT scans.
The document discusses Simultaneous Localization and Mapping (SLAM). It covers several key aspects of SLAM including:
1) An overview of SLAM and how it allows robots to build maps of unknown environments while simultaneously keeping track of their location.
2) Details of the mathematical framework for SLAM, including how the robot's location and features in the environment can be represented as random variables and how Bayes filters can be used to estimate these variables.
3) Explanations of different SLAM algorithms and approaches such as Extended Kalman Filter SLAM and GraphSLAM, focusing on how measurements are incorporated to update the robot and feature location estimates.
This document introduce the literature 'Connecting Generative Adversarial Networks and Actor-Critic Methods' written by D. Pfau, O. Vinyals. This is used in the event named 'The meeting where we discuss DRL model or else'.
This document introduce the literature 'Deep Compression' written by S. Han, et al. You can easily understand that literature by reading this. Only Japanese.
2. 今回取り上げるのはこれ
[1] Volodymyr Mnih, Adria` Puigdome`nech Badia, Mehdi
Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David
Silver, and Koray Kavukcuoglu. Asynchronous methods for
deep reinforcement learning. In Proceedings of the 33rd
International Conference on Machine Learning (ICML), pp.
1928–1937, 2016.
Asynchronousな手法によりreplay memoryを廃し、DQNより
高速かつ高精度な学習を達成した!
4. 強化学習の基本①
Li θi( )= E r +γ max
a'
Q s',a';θi−1( )−Q s,a;θi( )( )
2
1-step Q学習の損失関数
actor-criticにおける
目的関数の勾配
1-step Sarsaの損失関数 Li θi( )= E r +γQ s',a';θi−1( )−Q s,a;θi( )( )
2
n-step Q学習の損失関数 Li θi( )= E γk
rt+k
k=0
n
∑ + maxγ
a'
n
Q s',a';θi−1( )−Q s,a;θi( )
⎛
⎝
⎜
⎞
⎠
⎟
2
∇θ J θ( )= E ∇θ logπ at | st;θ( ) Rt −Vπ
st( )( )⎡
⎣
⎤
⎦
r
γ Q s,a;θi( )
Vπ
st( )
:割引率
:報酬
:状態 s で行動 a を取る場合の行動価値関数
:状態 s の価値関数
5. 強化学習の基本②
Li θi( )= E r +γ max
a'
Q s',a';θi−1( )−Q s,a;θi( )( )
2
1-step Q学習の損失関数
これがDQNの場合
L θ( )= Es,a,r,s'≈D r +γ max
a'
Q s',a';θ−
( )−Q s,a;θ( )( )
2
DQNの損失関数
:experience replay memory
:ターゲット・ネットワーク
D
θ−
12. Gorilaのしくみ
A. Nair, et al “Massively parallel methods for deep reinforcement learning.”
In ICML Deep learning Workshop. 2015.
13. Gorilaのしくみ ver.1
共有のreplay memoryを使用
Environment Q Network
Shard 1 Shard 2 Shard K
Q Network
Target
Q Network
DQN Loss
Parameter Server
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
ActorのcomputerとLearnerの
computer1つずつで1セットとする
Actor Learner
全部でNセット
replay memoryは1
つを共有する
Replay
Memory
14. Gorilaのしくみ ver.2(bundled mode)
個別のreplay memoryを使用
Environment Q Network
Shard 1 Shard 2 Shard K
Q Network
Target
Q Network
DQN Loss
Replay
Memory
Parameter Server
Environment Q Network
Q Network
Target
Q Network
DQN Loss
Replay
Memory
・
・
・
ActorのcomputerとLearnerの
computer1つずつで1セットとする
Actor Learner
全部でNセット
replay memoryはそれぞれ
のcomputerに配置
15. Gorila(bundled mode)から
asynchronousなDQNへの変更点①
Environment Q Network
Shard 1 Shard 2 Shard K
Q Network
Target
Q Network
DQN Loss
Replay
Memory
Parameter Server
Environment Q Network
Q Network
Target
Q Network
DQN Loss
Replay
Memory
・
・
・
CPU上の1つのスレッドに対応
Actor Learner
replay memoryを廃止
16. Gorila(bundled mode)から
asynchronousなDQNへの変更点②
Environment Q Network
Shard 1 Shard 2 Shard K
Q Network
Target
Q Network
DQN Loss
Parameter Server
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
代わりに勾配を溜め込む
gradients
gradients
17. Gorila(bundled mode)から
asynchronousなDQNへの変更点③
Environment Q Network
Shard 1 Shard 2 Shard K
Q Network
Target
Q Network
DQN Loss
Parameter Server for Q-Network
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
gradients
gradients
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
Target Q-Network用のserverを作る
18. Shard 1 Shard 2 Shard K
Parameter Server for Q-Network
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
AsynchronousなDQNの流れ①
Environment Q Network
Q Network
Target
Q Network
DQN Loss
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
θをコピー をコピー
gradients
gradients
θ−
19. Shard 1 Shard 2 Shard K
Parameter Server for Q-Network
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
AsynchronousなDQNの流れ②
Environment Q Network
Q Network
Target
Q Network
DQN Loss gradients
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
状態 s で行動 a をとり、s’ や r を観測
gradients
20. Shard 1 Shard 2 Shard K
Parameter Server for Q-Network
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
AsynchronousなDQNの流れ③
Environment Q Network
Q Network
Target
Q Network
DQN Loss gradients
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
gradients
L θ( )= Es,a,r,s'≈D r +γ max
a'
Q s',a';θ−
( )−Q s,a;θ( )( )
2
Lossを計算
21. Shard 1 Shard 2 Shard K
Parameter Server for Q-Network
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
AsynchronousなDQNの流れ④
Environment Q Network
Q Network
Target
Q Network
DQN Loss gradients
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
gradients
勾配を溜め込む dθ ← dθ +
∂L θ( )
∂θ
22. Shard 1 Shard 2 Shard K
Parameter Server for Q-Network
Shard 1 Shard 2 Shard K
Parameter Server for Target Q-Network
AsynchronousなDQNの流れ⑤
Environment Q Network
Q Network
Target
Q Network
DQN Loss gradients
Environment Q Network
Q Network
Target
Q Network
DQN Loss
・
・
・
Actor Learner
gradients
定期的に勾配の積算値 を送り学習する dθ