Deep Deformable Q-Network An Extension of Deep Q-Network
Deep Deformable Q-Network An Extension of Deep Q-Network
ABSTRACT with the reality that the real world problems are usually high-
The performance of Deep Reinforcement Learning (DRL) algorithms dimensional . On the other hand, these methods rely on manually
is usually constrained by instability and variability. In this work,we extracted features to represent the environment, which limits the
present an extension of Deep Q-Network (DQN) called Deep De- flexibility of the agent. Recently, deep learning has made great
formable Q-Network which is based on deformable convolution progress in various fields [22–24] because the neural networks have
mechanisms. The new algorithm can readily be built on existing efficient generalization ability and powerful abstraction ability. [1]
models and can be easily trained end-to-end by standard back- first put forward the deep reinforcement learning and Deep Q-
propagation.Extensive experiments on the Atari games validate Network (DQN) by successfully integrating the Deep Learning and
the feasibility and effectiveness of the proposed Deep Deformable Reinforcement Learning. DQN presented a remarkably flexible and
Q-Network. stable algorithm, showing great success in the majority of games
within the Arcade Learning Environment (ALE) [25]. The success of
CCS CONCEPTS Deep Q- Network (DQN) inspires researchers to seek improvements
in order to further its learning abilities. In [2], Mnih et al., introduced
• Computing methodologies → Supervised learning by classifi-
a mechanism to improve deep Q-network, which learns successful
cation;
policies directly from high-dimensional sensory inputs using end-
to-end reinforcement learning.Similarly, another extension of DQN
KEYWORDS
is presented in [26], which works on âĂIJsoftâĂİ and âĂIJhardâĂİ
Deep Q-Network, deformable convolution layer, Deep Learning, attention mechanisms. Another method is Double DQN, which uses
Reinforcement Learning the existing architecture and deep neural network of the DQN to
ACM Reference format: find better policies [6]. Mnih et al., developed a framework [27]
Beibei Jin, Jianing Yang, Xiangsheng Huang, and Dawar Khan. 2017. Deep for prioritizing experience. This framework frequently replays the
Deformable Q-Network:An Extension of Deep Q-Network. In Proceedings important transitions and therefore learn more efficiently. In [7],
of WI ’17, Leipzig, Germany, August 23-26, 2017, 4 pages. Wang et al., presented two separate estimators: one for the state
https://doi.org/10.1145/3106426.3109426 value function and one for the state-dependent action advantage
function, which generalizes the learning across actions without
1 INTRODUCTION imposing any change to the underlying reinforcement learning
Reinforcement Learning (RL) is to learn how to map the states to algorithm.
the actions, so as to obtain the maximum numerical reward signal. However, the CNNs used in Deep Q-Network usually have fixed
In Reinforcement Learning, an agent seeks an optimal policy for a shape of receptive fields, which is undesirable for high level layers
sequential decision making problem [18]. To the best of our knowl- that encode the semantics over spatial locations and limits them
edge, delayed reward and trial-and-error search are the two most for modeling large, unkonwn transformations.
important and distinguishing features for reinforcement learning. In this work, We propose a reinforcement learning method called
Reinforcement Learning has a wide application prospect for solving Deep deformable Q-Network based on deformable convolution
complex control and decision making problems. neural network by improving the structure of the neurak network in
During the course of the development of Reinforcement Learning Deep Q-Network. Different from the conventional Deep Q-Network,
(RL), many algorithms including Q-learning [19], SARSA [18, 20], the CNNs used in Deep deformable Q-Network have a kind of new
and policy gradient methods [21] have been introduced to solve RL convolution unit which has more diverse forms of receptive fields
problems. But most traditional methods usually assume that the rather than fixed shape of receptive fields. Moreover, the shape of
state space and the action space are discrete, which is inconsistent the receptive fields can be learned during the training procedure.
The rest of the paper is organized as follow. In section 2, the
Permission to make digital or hard copies of all or part of this work for personal or specific algorithms are described. Section 3 presents the structure of
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
the network and analyzes the result of the experiments. Conclusions
on the first page. Copyrights for components of this work owned by others than ACM are formulated in Section 4.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. 2 DEEP DEFORMABLE Q-NETWORK
WI ’17, August 23-26, 2017, Leipzig, Germany In Reinforcement Learning (RL), an agent is faced with a sequential
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-4951-2/17/08. . . $15.00 decision making problem [1, 2], where interaction with the envi-
https://doi.org/10.1145/3106426.3109426 ronment takes at discrete time steps (t = 0, 1, ...). At time t the
963
WI ’17, August 23-26, 2017, Leipzig, Germany B. Jin et al.
Figure 1: The overall network structure diagram.The box above is the illustration of 3*3 deformable convolution.The box
bottom is the illustration of the network used in this paper.
agent observes state st ∈ S and selects an action at ∈ A, which is the discount factor. The goal of the agent is to find an optimal
results in a scalar reward r t ∈ R, and a transition to a next state policy that maximize its expected discounted cumulative reward.
st +1 ∈ S. We consider infinite horizon problems with a discounted
cumulative reward objective R t = Tt′ =t γ t −t r t ′ , where γ ∈ [0, 1]
Í ′
964
Deep Deformable Q-Network:An Extension of Deep Q-Network WI ’17, August 23-26, 2017, Leipzig, Germany
Figure 2: The first plot shows the average maximum predicted action-value of a held out set of states on Breakout respectively.
The third plot shows average reward per episode on Breakout respectively during training. The statistics were computed by
running an ε-greedy policy.
2.1 Deep Q-Learning Now the sampling is operated over the irregular and offsetted loca-
We consider the usual Deep Q-Network [1]. Deep Q-Network is a tions pn + ∆pn . Since the offset ∆pn is typically fractional, Eq. (2)
multiple-layered neural network that for a given state s outputs a is implemented through bilinear interpolation as
vector of action value Q(s, a; θ ), θ denotes parameters of the neural
Õ
x(p) = G(q, p) · x(q) (6)
network. For an n-dimensional state space and an action space q
including m actions, the Deep Q-Network is a function from R n to
where p denotes an arbitrary location (p = p0 + pn + ∆pn for Eq.
Rm . The loss function of Deep Q-Network is as follows [1]:
(2)), q enumerates all integral spatial locations in the feature map
Li (θ i ) = Es,a,r,s ′ [(yi − Q(s, a, θ i ))2 ], (1) X , and G is the bilinear interpolation kernel.
The offsets are obtained by applying a convolutional layer on
Where the same spatial resolution with the input feature map. The channel
yi = (r i + γ maxa ′ Q(s ′, a ′, θ − )) (2) dimension is 2N, encoding N 2D offset vectors.
Li represents the expected error when the parameter is θ i . θ −
represents the parameters of a separated target-network, and θ i 3 EXPERIMENT
represents the parameters of the online-network. By using target- The proposed algorithm was tested on a popular Atari 2600 game
network, the performance of learning updates is improved. The which is called Breakout. The project was implemented on Keras
gradient descent part is as follows [1]: using tensorflow backend.
∇θ i Li (θ i ) = Es,a,r,s ′ [(yi − Q(s, a; θ i ))∇θ i Q(s, a)] (3) 3.1 Network Architecture
In order to avoid correlated updates, experience replay with fixed The structure of the network in this work is based on the network
maximum capacity is introduced to the Deep Q-Network. Tran- in [1] . We adjust the network in [1] and combine deformable con-
sitions from previous episodes are sampled from replay memory volution layers into it. The overall network structure is illustrated
multiple times to update the network, therefore divergence issue in figure 1. Table 1 details the parameters of each layer of the net-
caused by correlated updates is avoided. work. The final input representation to the neural network is an
84*84*4 image which is stacked by 4 frames . The first hidden layer
2.2 Deformable Convolution convolves 32 8*8 filters with stride 1 with the input image and
Generally, a convolution unit has a number of learnable filters, applies a subsample of 4*4 followed by a rectifier nonlinearity . The
and each filter is convolved with its receptive field. In terms of 2D second hidden layer convolves 64 4*4 filters with stride 1, followed
convolution, the grid R denotes the field size and dilation [9]. For by a subsample of 2*2 and a rectifier nonlinearity. The third hidden
example, layer convolves 64 3*3 filters with stride 1, again followed by a rec-
tifier units. The final hidden layer is fully-connected and consists
R = {(−1, −1), (−1, 0), . . . , (0, 1), (1, 1)} of 512 rectifier units. The output layer is a fully-connected linear
layer with a single output for each valid action. The deformable
defines a 3 × 3 kernel with dilation 1.
convolution layer can be put before any convolution layer. The
For each location p0 on the feature map y, we have
Õ number of filters and the shape of the output in the deformable
y(p0 ) = w(pn ) · x(p0 + pn ). (4) convolution layer must be consistent with the layer before it. Table
pn ∈R 1 is an example with two deformable convolution layers before the
convolution hidden layers.
Deformable convolution [9] increases the regular grid R with offsets
{∆pn |n = 1, . . . , N }, where N = |R|, Eq. (1) becomes 3.2 Hyper-parameters
Õ
y(p0 ) = w(pn ) · (p0 + pn + ∆pn ) (5) In our experiments, the discount factor γ was set to 0. 99, the
pn ∈R learning rate α is set to 0.00025. The number of steps between
965
WI ’17, August 23-26, 2017, Leipzig, Germany B. Jin et al.
target network updates was 10,000. Training was done over 12,000 [4] Osband, I., Blundell, C., Pritzel, A. and Van Roy, B., 2016. Deep exploration via
episodes . The agent was evaluated after every 10,000 steps based bootstrapped DQN. In Advances In Neural Information Processing Systems (pp.
4026-4034).
in the average reward per episode obtained by running an e-greedy [5] Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol.
policy with e annealed linearly from 1 to 0.1 over the first million 1, No. 1). Cambridge: MIT press.
[6] Van Hasselt, H., Guez, A. and Silver, D., 2016, March. Deep Reinforcement Learn-
frames, and fixed at 0.1 thereafter. The size of the experience replay ing with Double Q-Learning. In AAAI (pp. 2094-2100).
memory was 400,000 tuples. The memory was sampled to update [7] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M. and de Freitas,
the network every 4 steps with minibatches of size 32. The model N., 2015. Dueling network architectures for deep reinforcement learning. arXiv
preprint arXiv:1511.06581.
was trained using the backpropagation through time. [8] Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-
dimensional continuous control using generalized advantage estimation. arXiv
3.3 Training and Results preprint arXiv:1506.02438.
[9] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. and Wei, Y., 2017. Deformable
Unlike supervised learning, accurately evaluating the progress of Convolutional Networks. arXiv preprint arXiv:1703.06211.
an agent in reinforcement learning can be challenging. The total [10] Jeon, Y. and Kim, J., 2017. Active Convolution: Learning the Shape of Convolution
for Image Classification. arXiv preprint arXiv:1703.09076.
reward of the agent in an episode averaged over a number of games, [11] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethink-
so that it only can be computed during training. The first plot in ing the inception architecture for computer vision. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 2818-2826).
figure 2 shows how the average total reward evolves during training [12] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for
on the games Breakout. The average total reward metric tends to large-scale image recognition. arXiv preprint arXiv:1409.1556.
be very noisy because small changes to the weights of a policy [13] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification
with deep convolutional neural networks. In Advances in neural information
can lead to large changes in the distribution of states the policy processing systems (pp. 1097-1105).
visits [1] . The estimated action-value function Q which estimates [14] Wang, J., Zhou, J., Wonka, P. and Ye, J., 2013. Advances in neural information
processing systems. In Neural information processing systems foundation.
how much discounted reward the agent can obtain is more stable. [15] Sutton, R.S. and Barto, A.G., 1998. Introduction to reinforcement learning (Vol.
We can see relatively smooth improvement to predicted Q during 135). Cambridge: MIT Press.
training and didn’t experience any divergence guarantees. [16] Stadie, B.C., Levine, S. and Abbeel, P., 2015. Incentivizing exploration in reinforce-
ment learning with deep predictive models. arXiv preprint arXiv:1507.00814.
[17] LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553),
4 CONCLUSION AND FUTURE WORK pp.436-444.
[18] Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol.
In this paper, we have presented one way of integrating deformable 1, No. 1). Cambridge: MIT press.
convolution machanisms which provides more freedom to a conven- [19] Watkins, C.J. and Dayan, P., 1992. Q-learning. Machine learning, 8(3-4), pp.279-
tional convolution into the structure of Deep Q-Network. Through 292.
[20] Rummery, G.A. and Niranjan, M., 1994. On-line Q-learning using connectionist
experiments, we showed the feasibility and effectiveness of our systems. University of Cambridge, Department of Engineering.
approach. [21] Sutton, R.S., McAllester, D.A., Singh, S.P. and Mansour, Y., 1999, November. Policy
gradient methods for reinforcement learning with function approximation. In
In future work, we may dynamically learn when and how to NIPS (Vol. 99, pp. 1057-1063).
insert deformable convolution layers for best results. Finally, incor- [22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification
porating deformable convolution layers within on-policy methods with deep convolutional neural networks. In Advances in neural information
processing systems (pp. 1097-1105).
such as SARSA and Actor-Critic methods [25] can further improve [23] Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for
these algorithms. semantic segmentation. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (pp. 3431-3440).
[24] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature hierarchies
REFERENCES for accurate object detection and semantic segmentation. In Proceedings of the
[1] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis IEEE conference on computer vision and pattern recognition (pp. 580-587).
Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep [25] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and
reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). Kavukcuoglu, K., 2016, June. Asynchronous methods for deep reinforcement
[2] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., learning. In International Conference on Machine Learning (pp. 1928-1937).
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. [26] Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A. and Ignateva, A., 2015. Deep
Human-level control through deep reinforcement learning. Nature, 518(7540), attention recurrent Q-network. arXiv preprint arXiv:1512.01693.
pp.529-533. [27] Schaul, T., Quan, J., Antonoglou, I. and Silver, D., 2015. Prioritized experience
[3] Hausknecht, M. and Stone, P., 2015. Deep recurrent q-learning for partially replay. arXiv preprint arXiv:1511.05952.
observable mdps. arXiv preprint arXiv:1507.06527.
966