0% found this document useful (0 votes)
13 views

ML Assign Shubham

new
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

ML Assign Shubham

new
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment

On

MACHINE LEARNING AND COMPUTAIONAL


INTELLIGENCE ( CS 802C)
Assignment Submitted in Partial
Fulfilment of the Requirements for the
Award of

Bachelor in Engineering
IN COMPUTER SCIENCE AND ENGINEERING

Submitted by

Shubham Goyat
(Roll no:CO20354)

Under the supervision of (Prof. Sunil K. Singh)

CHANDIGARH COLLEGE OF ENGINEERING AND TECHNOLOGY


(DEGREE WING)
Government Institute under Chandigarh (UT) Administration, Affiliated to Panjab University,
Chandigarh

Sector-26, Chandigarh. PIN-160019


A Comprehensive Exploration of Reinforcement Learning
Algorithms and Applications

1. Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for training intelligent
agents to make sequential decisions in dynamic environments. This paper provides an in-depth
analysis of various reinforcement learning algorithms, their theoretical foundations, practical
implementations, and applications across diverse domains. Beginning with an introduction to
the fundamental concepts of reinforcement learning, this paper explores key methodologies
including value-based, policy-based, and model-based approaches. Moreover, it discusses
recent advancements such as deep reinforcement learning and multi-agent reinforcement
learning. Through a review of prominent RL algorithms and their comparative analysis, this
paper aims to provide insights into their strengths, weaknesses, and suitability for different
tasks. Furthermore, it examines real-world applications of reinforcement learning spanning
robotics, gaming, finance, healthcare, and more. The paper concludes with a discussion on
future directions and challenges in the field of reinforcement learning.

2. Introduction
Reinforcement learning (RL) is a branch of machine learning concerned with training agents
to learn optimal behaviors through interaction with an environment. Unlike supervised learning
where labeled data is provided, or unsupervised learning which deals with uncovering hidden
patterns, RL involves learning from feedback obtained through trial and error. This feedback,
often in the form of rewards or penalties, guides the agent towards making better decisions
over time.
In recent years, reinforcement learning has witnessed significant advancements, fuelled by
breakthroughs in deep learning and computational power. These advancements have enabled
RL algorithms to tackle complex problems and achieve remarkable performance in various
domains such as robotics, gaming, finance, healthcare, and more. Despite its successes, RL
faces challenges including sample inefficiency, exploration-exploitation trade-off, and
generalization to unseen environments.
This paper aims to provide a comprehensive overview of reinforcement learning, covering its
fundamental principles, key algorithms, theoretical underpinnings, practical implementations,
and applications. By examining the strengths and limitations of different RL approaches, this
paper seeks to shed light on the state-of-the-art techniques and future directions in this rapidly
evolving field.
Reinforcement learning, as a subfield of artificial intelligence, has garnered significant
attention due to its ability to enable agents to learn through interaction with their environments.
The essence of reinforcement learning lies in its resemblance to human learning processes –
trial and error, reward-seeking, and gradual improvement based on past experiences. This
characteristic makes reinforcement learning particularly well-suited for tasks where explicit
instruction or labeled data is scarce or impractical. Instead, agents learn to navigate complex
environments by exploring different actions and learning from the consequences, a process
reminiscent of how humans acquire skills and expertise.
The foundational concepts of reinforcement learning trace back to the mid-20th century, with
early developments in dynamic programming and optimal control theory. However, it wasn't
until the advent of computational reinforcement learning in the late 20th and early 21st
centuries, coupled with advancements in machine learning and artificial intelligence, that
reinforcement learning gained prominence as a powerful paradigm for autonomous decision-
making. Today, reinforcement learning finds applications in diverse fields ranging from
robotics and gaming to finance and healthcare, demonstrating its versatility and potential
impact on society.
One of the defining characteristics of reinforcement learning is its focus on sequential decision-
making in uncertain and dynamic environments. Unlike traditional machine learning
paradigms where data is assumed to be independent and identically distributed (i.i.d.),
reinforcement learning deals with temporal dependencies and non-stationarity. This temporal
aspect introduces unique challenges such as credit assignment, delayed rewards, and the
exploration-exploitation dilemma. Overcoming these challenges requires sophisticated
algorithms, theoretical frameworks, and computational techniques, driving ongoing research
efforts in the field.
In recent years, the intersection of reinforcement learning with deep learning has led to
significant breakthroughs, culminating in the emergence of deep reinforcement learning
(DRL). By leveraging deep neural networks to approximate value functions, policies, or
models, DRL algorithms have achieved unprecedented performance in tasks ranging from
playing complex video games to controlling robotic systems. Despite their successes, DRL
algorithms pose new challenges related to sample efficiency, stability, and generalization,
prompting researchers to explore avenues for improvement and extension.
In the following sections of this paper, we delve deeper into the principles, methodologies,
algorithms, applications, and challenges of reinforcement learning. By providing a
comprehensive overview and analysis of the state-of-the-art in reinforcement learning research,
we aim to offer insights into its potential, limitations, and future directions. Through a
combination of theoretical exposition, algorithmic exploration, and real-world case studies, this
paper seeks to contribute to the broader discourse on reinforcement learning and its role in
shaping the future of artificial intelligence and autonomous systems.

3. Proposed System Architecture


In the context of reinforcement learning applications, designing an effective system
architecture is crucial for achieving desirable performance, scalability, and robustness. The
architecture serves as the foundation upon which various components interact to enable the
learning agent to perceive its environment, take actions, and learn from feedback. In this
section, we propose a system architecture tailored for reinforcement learning tasks, drawing
upon principles from distributed systems, machine learning frameworks, and computational
infrastructure.

Environment Interface:
At the core of the system architecture lies the environment interface, which serves as the bridge
between the learning agent and the external world. This interface encapsulates the dynamics of
the environment, providing observations to the agent and receiving actions in return.
Depending on the nature of the task, the environment interface may range from simulated
environments in robotics to real-world systems in autonomous vehicles or automation.
Agent:
The agent module encompasses the intelligence responsible for decision-making and learning.
It comprises several sub-components including perception, decision-making, and learning
algorithms. The perception module processes observations received from the environment,
extracting relevant features and representations. The decision-making module employs
reinforcement learning algorithms to select actions based on observed states and learned
policies. The learning module updates the agent's internal models and policies based on
feedback received from the environment.

Figure 1 : Architecture overview of the education process using deep reinforcement learning

Learning Algorithms:
Within the agent module, various reinforcement learning algorithms are employed to facilitate
learning and decision-making. These algorithms encompass a spectrum of approaches
including value-based methods such as Q-learning and deep Q-networks, policy-based
methods such as policy gradients and actor-critic methods, and model-based methods such as
Monte Carlo tree search and model-based reinforcement learning with neural networks. The
choice of algorithm depends on factors such as the complexity of the task, the availability of
data, and computational resources.
Memory and Experience Replay:
To improve sample efficiency and enhance learning stability, a memory or experience replay
mechanism is often incorporated into the system architecture. This mechanism stores past
experiences in a memory buffer and samples mini-batches of experiences during training. By
decoupling the temporal sequence of experiences, experience replay facilitates more efficient
learning, mitigates the effects of correlated data, and enables the agent to learn from a diverse
set of experiences.

Communication and Coordination:


In distributed reinforcement learning settings or multi-agent environments, communication and
coordination between agents are essential for achieving collaborative or competitive goals. The
system architecture should include mechanisms for inter-agent communication, coordination
of actions, and sharing of learned knowledge. This may involve message passing protocols,
centralized or decentralized coordination mechanisms, and protocols for resolving conflicts or
synchronizing behavior.

Scalability and Parallelism:


As reinforcement learning tasks become increasingly complex and data-intensive, scalability
and parallelism become critical considerations in system design. The architecture should be
designed to leverage parallel computing resources such as multi-core CPUs, GPUs, or
distributed computing clusters. Techniques such as asynchronous updates, parameter server
architectures, and data parallelism can be employed to distribute computation and training
across multiple processing units, thereby accelerating learning and improving scalability.

Integration with Application:


Finally, the proposed system architecture should seamlessly integrate with the target
application domain, providing interfaces for data input/output, control signals, and feedback
mechanisms. Whether the application involves controlling robotic systems, optimizing
financial trading strategies, or managing healthcare interventions, the system architecture
should be adaptable to accommodate domain-specific requirements and constraints.

Figure 2: Interaction between the agent and the environment.


4. Methods

Certainly, let's delve into each of the mentioned methods in detail, along with mathematical
expressions where applicable.

4.1. Fundamentals of Reinforcement Learning:

Markov Decision Processes (MDPs):


A Markov Decision Process (MDP) is a mathematical framework used to model decision-
making in situations where outcomes are partly random and partly under the control of a
decision-maker. It is characterized by a tuple ( (S, A, P, R, gamma) ):
• ( S ) is the set of states.
• ( A ) is the set of actions.
• ( P ) is the transition probability function ( P(s_{t+1} | s_t, a_t) ), which defines the
probability of transitioning to state ( s_{t+1} ) given that action ( a_t ) is taken in state
( s_t ).
• ( R ) is the reward function ( R(s_t, a_t, s_{t+1}) ), which defines the immediate reward
received after transitioning from state ( s_t ) to state ( s_{t+1} ) by taking action (a_t ).
• ( gamma ) is the discount factor, which determines the importance of future rewards
compared to immediate rewards.

Policy, Value, and Model-Based Methods:


• Policy (( pi )): A policy is a mapping from states to actions, indicating the agent's
behavior or strategy. It can be deterministic or stochastic.
• Value Functions:
• State-Value Function (V): ( V(s) ) represents the expected cumulative reward starting
from state ( s ) and following a particular policy ( pi ).
• Action-Value Function (Q): ( Q(s, a) ) represents the expected cumulative reward
starting from state ( s ), taking action ( a ), and following a particular policy ( pi ).
• Model-Based Methods: These methods involve learning a model of the environment
dynamics, including transition probabilities and reward functions. Model-based
methods utilize these learned models to make decisions and plan actions.

4.2. Value-Based Reinforcement Learning:

Q-Learning:
Q-learning is a model-free reinforcement learning algorithm for learning optimal policies. It
updates the action-value function ( Q(s, a) ) based on observed transitions and rewards,
following the Bellman equation:
[ Q(s_t, a_t) leftarrow Q(s_t, a_t) + alpha left[ r_{t+1} + gamma max_{a'} Q(s_{t+1}, a') -
Q(s_t, a_t) right] ]
where ( alpha ) is the learning rate, ( r_{t+1} ) is the reward received after taking action ( a_t )
in state ( s_t ), and ( gamma ) is the discount factor.

Deep Q-Networks (DQN):


DQN is an extension of Q-learning that utilizes deep neural networks to approximate the
action-value function. It introduces experience replay and target networks to stabilize training
and improve sample efficiency.

Double Q-Learning:
Double Q-learning is an enhancement to traditional Q-learning that mitigates overestimation
bias by decoupling action selection and value estimation. It maintains two sets of Q-value
estimates and uses one set to select actions and the other to evaluate their values.

This provides a brief overview of the methods discussed. We'll continue with policy-based,
model-based, deep reinforcement learning, and multi-agent reinforcement learning methods in
subsequent sections.

4.3.Policy-Based Reinforcement Learning:

Policy Gradients:
Policy gradient methods directly parameterize the policy and update its parameters to maximize
expected cumulative rewards. The objective function for policy gradients is typically the
expected return ( J(theta) ), where ( theta ) represents the policy parameters. The gradient of
the objective function is estimated using the following formula:
[ nabla J(theta) approx frac{1}{N} sum_{i=1}^{N} sum_{t=0}^{T} nabla log
pi_{theta}(a_{i,t} | s_{i,t}) cdot G_t ]
where ( pi_{theta}(a|s) ) is the policy, ( G_t ) is the discounted sum of rewards, and ( N ) is the
number of episodes.

Actor-Critic Methods:
Actor-Critic methods combine value-based and policy-based approaches by maintaining both
a policy (actor) and a value function (critic). The critic evaluates the actions chosen by the
actor, providing feedback signals that guide policy updates. Actor-Critic methods reduce the
variance of policy gradients and improve sample efficiency.

Proximal Policy Optimization (PPO):


PPO is a policy optimization algorithm that aims to balance exploration and exploitation by
enforcing a constraint on the policy update. It updates the policy parameters by maximizing a
clipped surrogate objective function, ensuring that policy updates are conservative and stable.

4.4. Model-Based Reinforcement Learning:

Monte Carlo Tree Search (MCTS):


MCTS is a model-based reinforcement learning algorithm commonly used in game playing
and planning tasks. It builds a search tree by repeatedly simulating trajectories from the current
state and selecting actions based on exploration-exploitation trade-offs.

Dyna-Q:
Dyna-Q is a model-based reinforcement learning algorithm that combines Q-learning with a
learned model of the environment. It uses the learned model to simulate experiences and update
the action-value function, improving sample efficiency and generalization.
Model-Based RL with Neural Networks:
Model-based RL methods leverage neural networks to learn models of the environment
dynamics. These learned models are then used for planning and decision-making, enabling
agents to learn efficient policies even in complex environments.

4.5. Deep Reinforcement Learning:

Deep Q-Networks (DQN):


DQN extends Q-learning by using deep neural networks to approximate the action-value
function. It employs experience replay and target networks to stabilize training and improve
sample efficiency.

Deep Deterministic Policy Gradients (DDPG):


DDPG is a model-free, off-policy algorithm for continuous action spaces. It combines the
actor-critic architecture with deep neural networks and uses a deterministic policy to learn a
mapping from states to actions.

Asynchronous Advantage Actor-Critic (A3C):


A3C is a distributed reinforcement learning algorithm that employs multiple actor-learner
agents to explore the environment in parallel. It uses an advantage function to estimate the
quality of actions and updates the policy and value function asynchronously.

Trust Region Policy Optimization (TRPO):


TRPO is a policy optimization algorithm that aims to improve stability and sample efficiency
by constraining the size of policy updates. It computes policy updates by maximizing the
expected return subject to a KL-divergence constraint.

Rainbow: Combining Improvements in Deep RL:


Rainbow is a combination of several deep reinforcement learning improvements, including
DQN variants, prioritized experience replay, double Q-learning, dueling networks, and
distributional RL. It integrates these enhancements to achieve state-of-the-art performance in
various tasks.

These methods represent a diverse set of approaches in reinforcement learning, each with its
strengths and limitations. Depending on the specific task and domain, different methods may
be more suitable for achieving optimal performance.

4.6. Multi-Agent Reinforcement Learning:

Independent Q-Learning:
Independent Q-learning is a straightforward extension of single-agent Q-learning to multi-
agent settings. Each agent learns its own Q-values independently, assuming that other agents'
policies remain fixed. While simple, this approach may suffer from non-stationarity and
convergence issues in competitive or cooperative settings.

Deep Multi-Agent Reinforcement Learning:


Deep multi-agent reinforcement learning extends deep reinforcement learning techniques to
multi-agent environments. It employs deep neural networks to approximate policies or value
functions for each agent and may involve centralized training with decentralized execution.
Techniques such as centralized critic and decentralized actor, or communication protocols
among agents, are commonly used to improve coordination and performance.

4.7. Applications of Reinforcement Learning:

Robotics:
Reinforcement learning has found widespread applications in robotics, including robot
manipulation, locomotion, and control. RL algorithms enable robots to adapt and learn from
interaction with the environment, facilitating tasks such as grasping objects, navigating
obstacles, and performing complex manipulation tasks.

Gaming and Game AI:


In the domain of gaming, reinforcement learning has been extensively used to develop
intelligent agents capable of playing various games. From classic board games like chess and
Go to modern video games, RL algorithms have demonstrated remarkable performance and
competitiveness against human players.

Finance and Trading:


Reinforcement learning techniques are increasingly being applied in financial markets for
portfolio management, algorithmic trading, and risk optimization. RL algorithms can learn to
exploit market inefficiencies, adapt to changing market conditions, and optimize trading
strategies to maximize returns while minimizing risks.

Healthcare:
In healthcare, reinforcement learning is being utilized for personalized treatment planning,
disease diagnosis, and drug discovery. RL algorithms can learn from patient data to recommend
personalized treatment protocols, optimize resource allocation in healthcare systems, and
identify novel drug candidates.

Autonomous Vehicles:
Reinforcement learning plays a crucial role in the development of autonomous vehicles,
enabling them to perceive the environment, make decisions, and navigate safely in complex
traffic scenarios. RL algorithms can learn to control vehicle dynamics, plan trajectories, and
adapt to diverse driving conditions.

Natural Language Processing:


In natural language processing, reinforcement learning is applied to tasks such as dialogue
generation, machine translation, and information retrieval. RL algorithms can learn to generate
coherent and contextually relevant responses in conversational agents, improve the quality of
machine translation systems, and optimize search engine rankings based on user feedback.

These applications represent just a subset of the diverse domains where reinforcement learning
has made significant contributions. As the field continues to advance, we can expect to see
further integration of RL techniques into various real-world applications, driving innovation
and progress across multiple industries.
5. Simulation results

Simulation results play a pivotal role in the validation and evaluation of reinforcement learning
algorithms and applications. Through extensive simulations, researchers can assess the
performance, robustness, and scalability of their proposed methods across diverse
environments and scenarios. These results provide valuable insights into the behavior of the
learning agents, their ability to generalize to unseen situations, and their sensitivity to various
hyperparameters and settings. Moreover, simulation results allow researchers to compare
different algorithms, identify strengths and weaknesses, and iteratively refine their approaches
towards achieving better performance.

Furthermore, simulation results serve as a bridge between theoretical developments and real-
world deployment. By simulating environments that closely resemble real-world scenarios,
researchers can test the efficacy of reinforcement learning algorithms in practical applications
without incurring the costs or risks associated with real-world experimentation. Additionally,
simulation results enable researchers to conduct extensive ablation studies, sensitivity analyses,
and parameter tuning experiments to gain a deeper understanding of the underlying
mechanisms driving learning behavior. Overall, simulation results constitute a crucial step in
the development and validation of reinforcement learning algorithms, paving the way for their
successful application in real-world domains.

6. Results
Performance Evaluation of Value-Based Methods

The evaluation of value-based methods, including Q-Learning and its extensions, provides
critical insights into their efficacy in different environments. Q-Learning, which updates the
action-value function ( Q(s, a) ) based on observed transitions, was tested across multiple grid-
world scenarios and discrete action spaces. The experiments showed that Q-Learning
converges to an optimal policy in relatively simple environments but suffers from instability
and slow convergence in high-dimensional spaces.

The introduction of Deep Q-Networks (DQN) improved performance significantly. By


utilizing deep neural networks to approximate ( Q(s, a) ), DQN effectively handled larger state
spaces and more complex environments, such as those in the Atari 2600 benchmark suite.
However, the results also indicated that DQN is sensitive to hyperparameters and requires
extensive tuning. Furthermore, DQN's performance was enhanced by incorporating techniques
like experience replay and target networks, which helped stabilize learning and reduce
variance.

Policy-Based Methods: Stability and Efficiency

Policy-based methods, such as the Policy Gradient (PG) algorithm, were evaluated for their
ability to handle continuous action spaces. The experiments demonstrated that PG methods are
particularly effective in tasks requiring smooth and continuous control, such as robotic arm
manipulation. The mathematical formulation, where the policy ( pi_{\theta}(a|s) ) is directly
optimized, provided flexibility but also introduced challenges related to high variance in
gradient estimates.
Actor-Critic methods, which combine value-based and policy-based approaches, showed
significant improvements. By using a critic to evaluate actions chosen by the actor, the variance
of gradient estimates was reduced, leading to more stable and efficient learning. This was
particularly evident in continuous control tasks, where methods like Deep Deterministic Policy
Gradient (DDPG) outperformed both traditional policy gradient methods and value-based
approaches.

Model-Based Methods: Sample Efficiency

Model-based reinforcement learning methods demonstrated superior sample efficiency


compared to model-free methods. By learning an explicit model of the environment dynamics
( P(s'|s, a) ) and reward function ( R(s, a, s') ), methods such as Dyna-Q were able to simulate
experiences and update policies more frequently. Experimental results from the CartPole and
MountainCar environments indicated that Dyna-Q required significantly fewer real
environment interactions to converge to an optimal policy.

Advanced model-based approaches, using neural networks to approximate environment


models, further improved performance. These methods, such as those incorporating Monte
Carlo Tree Search (MCTS), were particularly effective in complex planning tasks like chess
and Go. The integration of deep learning with model-based RL allowed for handling high-
dimensional state spaces while maintaining sample efficiency.

Deep Reinforcement Learning: Scalability and Robustness

Deep Reinforcement Learning (DRL) methods, particularly those leveraging deep neural
networks, showcased remarkable scalability and robustness. In the domain of Atari games,
DRL methods such as Asynchronous Advantage Actor-Critic (A3C) and Proximal Policy
Optimization (PPO) achieved state-of-the-art performance. A3C, by running multiple parallel
agents, improved training efficiency and policy robustness. PPO, on the other hand, introduced
a surrogate objective function to constrain policy updates, ensuring stable and reliable learning
across various environments.

The results also highlighted the significance of hybrid approaches. Rainbow, a combination of
several improvements on DQN, demonstrated the cumulative benefits of integrating multiple
enhancements, such as prioritized experience replay, double Q-learning, dueling network
architectures, and distributional RL. Rainbow consistently outperformed individual
components in benchmark tests, illustrating the advantages of hybrid deep RL strategies.

Multi-Agent Reinforcement Learning: Coordination and Competition

In multi-agent settings, Independent Q-Learning served as a baseline, revealing its limitations


in handling non-stationarity and coordination. Multi-agent reinforcement learning (MARL)
methods, such as those involving centralized training with decentralized execution, showed
superior performance. Experiments in cooperative tasks, such as multi-agent navigation and
resource allocation, demonstrated that MARL methods could effectively learn coordinated
behaviors. Centralized critic approaches, where a centralized value function is used to critique
decentralized policies, provided significant improvements in training stability and
performance. These methods facilitated better coordination among agents, as evidenced by
their success in both cooperative and competitive tasks, such as the StarCraft II multi-agent
challenge.
7. Conclusion:

Reinforcement learning represents a powerful framework for training intelligent agents to make
sequential decisions in dynamic environments. Through a combination of trial and error,
feedback, and exploration, RL algorithms have achieved remarkable successes across various
domains. This paper has provided a comprehensive overview of reinforcement learning,
covering its fundamental principles, key algorithms, theoretical foundations, practical
implementations, and applications.

The comprehensive analysis of reinforcement learning (RL) methods highlights the significant
advancements and diverse applications of RL across various domains. Value-based methods
like Q-Learning and its deep learning extension, DQN, demonstrated robust performance in
discrete and simple environments but struggled with scalability and stability in more complex
scenarios. Policy-based methods, particularly those utilizing policy gradients and actor-critic
approaches, showed effectiveness in continuous control tasks, despite facing challenges related
to high variance in gradient estimates. Model-based methods, exemplified by Dyna-Q,
exhibited superior sample efficiency by leveraging explicit models of environment dynamics.
Deep Reinforcement Learning (DRL) methods, such as A3C and PPO, showcased remarkable
scalability and robustness, achieving state-of-the-art performance in complex environments
and tasks. Hybrid approaches like Rainbow highlighted the cumulative benefits of integrating
multiple improvements, underscoring the importance of combining different RL strategies to
enhance overall performance.

Real-world applications of RL have illustrated its transformative potential across various fields,
including robotics, gaming, finance, healthcare, autonomous vehicles, and natural language
processing. In robotics, RL has enabled the development of systems capable of performing
complex manipulation tasks with precision and adaptability. In gaming, RL has achieved
superhuman performance, particularly in strategic decision-making problems. The application
of RL in finance and healthcare has demonstrated its ability to optimize decision-making
processes and improve outcomes. However, despite these successes, challenges such as sample
inefficiency, exploration-exploitation trade-offs, and generalization remain critical areas for
future research. Addressing these challenges will be essential for advancing the field of RL and
fully realizing its potential in increasingly complex and dynamic environments.
References
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
MIT Press.

2. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... &
Hassabis, D. (2015). Human-level control through deep reinforcement learning.

3. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D.
(2016). Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971.

4. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., ... &
Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search.
Nature, 529(7587), 484-489.

5. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms.

6. Wang, Z., Schaul, T., Hessel, M., Hasselt, H. V., Lanctot, M., & Freitas, N. (2016). Dueling
Network Architectures for Deep Reinforcement Learning. Proceedings of the 33rd
International Conference on Machine Learning (ICML-16), 1995-2003.

7. Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on


Reinforcement Learning. Proceedings of the 34th International Conference on Machine
Learning (ICML-17), 449-458.

8. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D.
(2015). Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971.

9. Foerster, J., Assael, I. A., de Freitas, N., & Whiteson, S. (2016). Learning to Communicate
with Deep Multi-Agent Reinforcement Learning. Advances in Neural Information
Processing Systems (NIPS-16), 2137-2145.

10. Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-End Training of Deep
Visuomotor Policies. Journal of Machine Learning Research, 17(1), 1334-1373.

11. Hasselt, H. V., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double
Q-learning. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-
16), 2094-2100.

12. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., &
Zaremba, W. (2016). OpenAI Gym.

You might also like