0% found this document useful (0 votes)
4 views

[3] Deep Reinforcement Learning for Spacecraft Proximity Operations Guidance

Uploaded by

yagiz.kurt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

[3] Deep Reinforcement Learning for Spacecraft Proximity Operations Guidance

Uploaded by

yagiz.kurt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

JOURNAL OF SPACECRAFT AND ROCKETS

Vol. 58, No. 2, March–April 2021

Deep Reinforcement Learning for Spacecraft Proximity


Operations Guidance

Kirk Hovell∗ and Steve Ulrich†


Carleton University, Ottawa, Ontario K1S 5B6, Canada
https://doi.org/10.2514/1.A34838
This paper introduces a guidance strategy for spacecraft proximity operations, which leverages deep reinforcement
learning, a branch of artificial intelligence. This technique enables guidance strategies to be learned rather than
designed. The learned guidance strategy feeds velocity commands to a conventional controller to track. Control
theory is used alongside deep reinforcement learning to lower the learning burden and facilitate the transfer of the
learned behavior from simulation to reality. In this paper, a proof-of-concept spacecraft pose tracking and docking
scenario is considered, in simulation and experiment, to test the feasibility of the proposed approach. Results show
that such a system can be trained entirely in simulation and transferred to reality with comparable performance.
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

I. Introduction suggested action to take. At each time step, a scalar reward, which
may be positive or negative, is given to the agent and corresponds to
A UTONOMOUS spacecraft rendezvous and docking operations
have become an active research area in recent decades. Appli-
cations, such as on-orbit servicing, assembly, and debris capture [1],
task completion. Through trial and error, the agent attempts to learn a
policy that maps inputs to actions, such that the actions taken maxi-
require the capability for a chaser spacecraft to autonomously and mize the rewards received. By selecting an appropriate reward
safely maneuver itself in proximity to a potentially uncooperative scheme, complex behaviors can be learned by the agent without
being explicitly programmed. For a tethered space debris detumbling
target object. A common strategy is pose tracking (i.e., synchronizing
task, for example, rewards could be given for reducing the angular
the translational and rotational motion of the chaser with respect to
velocity of the debris, and penalties could be given for fuel usage,
the target, such that there is no relative motion between the two
forcing the agent to learn a fuel-efficient detumbling policy. The
objects). Only then does the chaser perform its final approach and
engineering effort is reduced to specifying the reward system rather
capture or dock with the target. Guidance and control algorithms have
than the complete logic required to complete the task. This is the main
been developed for this purpose. For example, a guidance and control
appeal of reinforcement learning. Neural networks have become a
scheme for capturing a tumbling debris with a robotic manipulator
popular choice for representing the policy in reinforcement learning,
was developed by Aghili [2]. Wilde et al. [3] developed inverse
as they are universal function approximators [11]. When neural net-
dynamics models to generate guidance paths, and included exper- works are used within reinforcement learning, the technique is called
imental validation. Ma et al. [4] applied feedforward optimal control deep reinforcement learning. The core concepts in reinforcement
for orienting a chaser spacecraft at a constant relative position learning have been around for decades, but have only become useful
with respect to a tumbling target. A variety of dual quaternion in recent years due to the rapid rise in computing power. Many
approaches have been explored [5–7]. Pothen and Ulrich [8,9] used notable papers have been published recently that use deep reinforce-
the Udwadia–Kalaba equation to formulate the close-range rendez- ment learning to solve previously unsolvable tasks. In 2015, Mnih
vous problem, and included experimental validation. Lyapunov vec- et al. [12] applied deep reinforcement learning to play many Atari
tor fields have been used to command docking with an uncooperative 2600 [13] games at a superhuman level by training a policy to select
target spacecraft by Hough and Ulrich [10]. button presses (the action) as a function of the screen pixels (the
The currently developed guidance and control techniques pre- input). Silver et al. [14,15] used deep reinforcement learning to
sented previously are handcrafted to solve a particular task and master the game of Go in 2016, a decade earlier than expected.
require significant engineering effort. As more complex tasks are Training deep reinforcement learning policies on physical robots is
introduced, the engineering effort needed to handcraft solutions may time consuming and expensive, and leads to significant wear and tear
become infeasible. For example, developing a guidance law for a on the robot because, even with state-of-the-art learning algorithms,
chaser spacecraft to detumble a piece of spinning space debris, when the task may have to be repeated hundreds or thousands of times
the two objects are connected via flexible tethers, does not have a before learning succeeds. Training a policy onboard a spacecraft is
clear solution that can be handcrafted. Motivated by more difficult not viable due to fuel, time, and computer limitations. An alternative
guidance tasks, this paper introduces a new approach that builds upon is to train the policy in a simulated environment and transfer the
a branch of artificial intelligence, called deep reinforcement lear- trained policy to a physical robot. If the simulated model is suffi-
ning, to augment the guidance capabilities of spacecraft for difficult ciently accurate and the task is not highly dynamic, policies trained in
tasks. simulation may be directly transferrable to a real robot [16]. For
Reinforcement learning is based on the idea of an agent trying to example, Tai et al. [17] trained a room-navigating robot in simulation
choose actions to maximize the rewards it receives over a period of and deployed it to reality with success. Although good results were
time. The agent uses a policy that, when given an input, returns a obtained, this technique is unlikely to generalize well to more diffi-
cult or dynamic tasks due to the effect known as the simulation-to-
Presented as Paper 2020-1600 at the AIAA Scitech 2020 Forum, Orlando, reality gap [16,18,19]. This effect states that, because the simulator
FL, January 6–10, 2020; received 21 April 2020; accepted for publication 22 within which the policy is trained cannot perfectly model the dynam-
October 2020; published online 14 January 2021. Copyright © 2020 by The ics of the real world, such a policy will fail to perform well in a real-
Authors. Published by the American Institute of Aeronautics and Astronautics, world environment due to overfitting of the simulated dynamics.
Inc., with permission. All requests for copying and permission to reprint Efforts to get around this problem often take advantage of domain
should be submitted to CCC at www.copyright.com; employ the eISSN randomization [20] (i.e., randomizing the environmental parameters
1533-6794 to initiate your request. See also AIAA Rights and Permissions
www.aiaa.org/randp.
for each simulation to force the policy to become robust to environ-
*Ph.D. Candidate, Department of Mechanical and Aerospace Engineering, mental changes). Domain randomization has been used successfully
1125 Colonel By Drive. Student Member AIAA. in a drone racing task [21] and in the manipulation of objects with a
† robotic hand [20]. However, domain randomization significantly
Associate Professor, Department of Mechanical and Aerospace Engineer-
ing, 1125 Colonel By Drive. Senior Member AIAA. increases the training time because the policy must learn to become
254
HOVELL AND ULRICH 255

adept at all dynamic variations of the task. Other efforts to solve the demonstration of artificial intelligence commanding the motion of
simulation-to-reality problem involve continuing to train the policy a spacecraft platform.
once deployed to experiment. A drifting car [22] policy was partially It should be noted that, although the authors were motivated
trained in simulation, and then fine-tuning training was performed by difficult guidance tasks, such as detumbling tethered space debris,
once experimental data were collected. A quadrotor stabilization task in this paper, a proof-of-concept task—a simple spacecraft pose
[23] summed the policy output with a conventional proportional tracking and docking scenario—is considered. Demonstrating the
derivative controller to guide the learning process and help with the deep guidance technique on a simple task will highlight its potential
simulation-to-reality transfer. for use on more difficult tasks.
Deep reinforcement learning has also been applied to aerospace This paper is organized as follows: Sec. II presents background on
applications, although mostly in simulation. A simulated fleet of deep reinforcement learning and the specific learning algorithm used
wildfire surveillance aircraft used deep reinforcement learning to in this paper, Sec. III presents the novel guidance concept developed
command the flight path of the aircraft [24]. Deep reinforcement by the authors, Sec. IV describes the pose tracking scenario consid-
learning has also been applied to spacecraft map generation while ered, Sec. V presents numerical simulations demonstrating the effec-
orbiting small bodies [25], spacecraft orbit control in unknown tiveness of the technique, Sec. VI presents the experimental results,
gravitational fields [26], and spacecraft orbital transfers [27]. Others and Sec. VII concludes this paper.
have used reinforcement learning to train a policy that performs
guidance and control for pinpoint planetary landing [28,29]. Neural
networks have been trained to approximate off-line-generated opti- II. Deep Reinforcement Learning
mal guidance paths for pinpoint planetary landing, such that the The goal of deep reinforcement learning is to discover a policy, π θ ,
neural network approximates an optimal guidance algorithm that represented by a feedforward neural network and whose subscript
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

can be executed in real time [30,31]. denotes it has trainable weights θ that maps states, x ∈ X , to actions,
Inspired by the ability of deep reinforcement learning to have a ∈ A that, when executed, maximize the expected rewards received
behavior be learned rather than handcrafted, and motivated by the over one episode. (In reinforcement learning, one simulation is called
need for new simulation-to-reality transfer techniques, this paper one episode.) The action is obtained by feeding the state to the policy,
introduces a novel technique that allows for the use of reinforcement as follows:
learning on a real spacecraft platform. The proposed technique builds
off of the planetary landing work [28,29], where the neural networks a  π θ x (1)
were trained to approximate handcrafted optimal guidance trajecto-
ries and a conventional controller was used to track the approximated Although many deep reinforcement learning algorithms are cur-
trajectory. Here, deep reinforcement learning is used to train a rently available, this work uses the distributed distributional deep
guidance policy whose trajectories are fed to a conventional con- deterministic policy gradient (D4PG) algorithm [36]. The D4PG
troller to track. Using reinforcement learning allows an unbiased algorithm, released in early 2018, was selected because it operates
guidance policy to be discovered by the agent instead of being shown in continuous state and action spaces, it has a deterministic output, it
many handcrafted guidance trajectories to mimic. The proposed can be trained in a distributed manner to take advantage of multi-CPU
technique is in contrast to typical reinforcement learning research, machines, and it achieves state-of-the-art performance.
where the policy is responsible for learning both the guidance and The D4PG [36] algorithm is an actor–critic algorithm, implying
control logic. By restricting the policy to learn only the guidance there is a policy neural network, π θ x, that maps states to actions and
portion, we harness the high-level, unbiased, task-solving abilities of a value neural network, Zϕ x; a, with trainable weights ϕ that maps a
reinforcement learning while deferring the control aspect to the well- state–action pair to a probability distribution of the predicted dis-
established control theory community. Control theories have been counted rewards for the remainder of the episode. The total dis-
developed that are able to perform trajectory tracking well under counted rewards expected to be received from a given state, x, and
dynamic uncertainty [32–34], and can therefore handle model dis- taking action a from the policy, π θ , is given by
crepancies between simulation and reality. Harris et al. [35] recently
presented a strategy that uses reinforcement learning to switch
between a set of available controllers depending on the system state. Jθ  EfZϕ x; π θ xg (2)
The authors [35] suggest that reinforcement learning should not be
tasked with learning guidance and control because control theory where Jθ is the expected reward from the given state as a function of
already has great success. By combining deep reinforcement learning the policy weights θ, and E denotes the expectation. The objective of
for guidance with conventional control theory, the policy is prevented reinforcement learning is then to find a policy π θ that maximizes Jθ
from learning a controller that overfits the error-prone simulated through systematically adjusting θ.
dynamics. We call our deep reinforcement learning guidance strategy To maximize Eq. (2), we must first establish the policy and value
deep guidance. The novel contributions of this work are 1) the deep neural networks, shown in Fig. 1. The policy network, shown in
guidance technique, which combines deep reinforcement learning as Fig. 1a, accepts the system state x as the input, shown with three
guidance with a conventional controller; 2) experimental demonstra- elements x1 , x2 , and x3 in the figure, and outputs a commanded
tions showing that this deep guidance strategy can be trained in action, a  π θ x, which may be multidimensional. Each arrow in
simulation and deployed to reality without any fine-tuning; and the network corresponds to a trainable weight that is parameterized
3) the first, to the best of the authors’ knowledge, experimental by θ. The value network, shown in Fig. 1b, accepts the state and

a) Policy neural network b) Value neural network


Fig. 1 Policy and value neural networks used in the D4PG algorithm.
256 HOVELL AND ULRICH

action vectors as inputs, and outputs a value distribution, Zϕ x; a. In To implement the algorithm, K agents run independent episodes
other words, it returns a probability distribution of how many dis- using the most up-to-date version of the policy. Gaussian explora-
counted rewards are predicted to be received from taking action a tion noise is applied to the chosen action to force exploration and is
in state x and continuing to follow the policy until the end of the the basis for discovering new strategies. The action is obtained
episode. through
Simulated state, action, next state, and reward data are obtained
from running episodes and placing the data in a large replay buffer at  π θ xt   N 0; σ 2  (10)
that stores the most recent R data. Batches of size M of simulated data
are randomly sampled from the replay buffer and used to train the where N 0; σ 2  is the normal distribution with a mean of 0 and an
policy and value neural networks. From a given batch of state, action, exploration noise standard deviation of σ. The collected data are the
next state, and reward data, the value network can be trained using state xt , action at , reward rt , and next state xtN , and are placed into
supervised learning. Gradient descent is used to minimize the cross- a replay buffer that stores the most recent R data points. Asynchro-
entropy loss function given by nously, a learner randomly samples batches of data from the replay
buffer and uses them to train the value network one step using
Lϕ  E−Y logZϕ x; a (3) Eq. (3), and then trains the policy network one step using Eq. (9).
Over time, the accuracy of the value network and the average
performance of each agent are expected to increase.
where Y is the target value distribution (i.e., the new best prediction of
the true value distribution based on the sampled data), as first intro-
duced by Bellmare et al. [37]. It is calculated using III. Deep Guidance
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Using deep reinforcement learning for spacecraft guidance may


X
N −1
allow for difficult tasks to be accomplished through learning an
Y γ n rn  γ N Zϕ 0 xN ; π θ 0 xN  (4)
appropriate behavior rather than handcrafting such a behavior. In
n0
order for the reinforcement learning algorithm presented in Sec. II to
where rn is the reward received at time step n, γ is the discount be trained in simulation and deployed to reality without any fine-
factor (to weigh current rewards higher than future rewards), and tuning on the spacecraft, it is proposed that the learning algorithm
Zϕ 0 xN ; π θ 0 xN  is the value distribution evaluated N time steps into cannot be responsible for the entire guidance, navigation, and control
the future. The N-step return [38] is used, where N data points into the (GNC) stack. This is to prevent the policy from overfitting the
future are included for a more accurate prediction of Y. It should also simulated dynamics and being unable to handle the transition to a
dynamically uncertain real world (i.e., the simulation-to-reality prob-
be noted that the symbols θ 0 and ϕ 0 indicate that these are not the true
lem) [16]. For this reason, the authors present a system, called deep
weights of the policy and value networks, but rather an exponential
guidance, which uses deep reinforcement learning as a guidance
moving average of them, calculated by
system along with a conventional controller. Conventional control-
lers are able to handle dynamic uncertainties and modeling errors that
θ 0  1 − ϵθ 0  ϵθ (5)
typically plague reinforcement learning policies that attempt to learn
the entire GNC routine. It is assumed that perfect navigation is
ϕ 0  1 − ϵϕ 0  ϵϕ (6) available. A block-scheme diagram of the proposed system is shown
in Fig. 2. The learned deep guidance block has the current state xt as
with ϵ ≪ 1. Having a copy of the policy and value networks with its input and the desired velocity vt as its output. The desired velocity
smoothed weights has been empirically shown to have a stabilizing is fed to a conventional controller, which also receives the current
effect on the learning process [12]. state xt and calculates a control effort ut . The control effort is
Equation (4) recursively calculates an updated prediction for executed on the dynamics that generate a scalar reward rt and the
the value distribution for a given state–action pair according to the next state xt1 .
reward data received. Then, by minimizing the loss function in Eq. (3) During training of the deep guidance model, an ideal controller is
through adjusting ϕ, using learning rate β, the value network slowly assumed, as shown in Fig. 3. This ensures that the guidance model
approaches these updated predictions, which are then smoothed and does not overfit to any specific controller, thereby making this
used in Eq. (4). This recursive process led Sutton and Barto [39] to approach controller-independent. Because the ideal controller per-
write, “we learn a guess from a guess.” fectly commands the dynamics to move at the desired velocity vt, the
With the training procedure for the value network outlined, the ideal controller and the dynamics model may be combined into a
policy network must now be trained. The goal is to adjust the policy single kinematics model.
parameters θ in the direction of increasing the expected discounted Once trained, any controller may be used alongside the deep
rewards for a given state, Jθ. Because neural networks are differ- guidance system for use on a real robot. Because the deep guidance
entiable, the chain rule can be used to compute: system only experiences an ideal controller during training, it is
possible that the guidance model will not experience any nonideal
∂Jθ ∂Jθ ∂a states that may be encountered when using a real controller, which
 (7)
∂θ ∂a ∂θ may harm performance. For this reason, Gaussian noise may be
applied to the output of the kinematics to force the system into
where ∂Jθ∕∂a is computed from the value network, and undesirable states during training that may be encountered by a
∂a∕∂θ is computed from the policy network. In a sense, we differ- nonideal controller.
entiate through the value network into the policy network. More The proposed system is trained and tested on a spacecraft pose
formally tracking and docking task detailed in the following section.

∇θ Jθ  E∇θ π θ xE∇a Zϕ x; ajaπθ x  (8)

describes how the policy parameters θ should be updated to increase


the expected rewards when the policy π θ is used. Finally, the
parameters θ are updated via

θ  θ  ∇θ Jθα (9)

for a learning rate α. Fig. 2 Proposed deep guidance strategy.


HOVELL AND ULRICH 257

Fig. 3 Proposed deep guidance with an ideal controller for training purposes in simulation.

IV. Problem Statement outputs the desired velocity as in Eq. (13) that is fed to a simple
This section presents the simulated spacecraft pose tracking and proportional velocity controller of the form:
docking environment the deep guidance system will be trained within.
Although the deep guidance technique may allow for more complex ut  Kp vt − x_ t  (14)
behaviors to be learned, a simple task is considered here as a proof of
concept. Similarly to other spacecraft researchers with experimental where Kp  diagf2; 2; 0.1g, and ut is the control effort. The Kp
validation [3,40–42], a double-integrator planar dynamics model is values were chosen by trial and error until satisfactory performance
used such that the available experimental facility can validate the was achieved.
simulated results. Including a full orbital model was deemed to be A double-integrator dynamics model is used to simulate the
outside the scope of this paper, because the goal of this work was to motion of the chaser:
demonstrate the simulation-to-reality ability of the proposed deep
Fx
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

guidance approach. Furthermore, a double-integrator model is repre- x  (15)


sentative of proximity operations in orbit over small distances and m
timescales [43].
A chaser spacecraft exists in a planar laboratory environment, and Fy
it is tasked with approaching and docking with a target spacecraft, as y  (16)
m
shown in Fig. 4. The chaser and target spacecraft start at rest. The
chaser spacecraft is given some time to maneuver to the hold point in τ
front of the target. Then, the chaser spacecraft is tasked with ψ  (17)
I
approaching the target such that the two may dock.
where Fx and Fy are the forces applied in the X and Y directions,
A. Kinematics and Dynamics Models respectively; τ is the torque applied about the Z axis; m is the chaser
During training, a kinematics model is used, which approximates a spacecraft mass; I is its moment of inertia; x is the acceleration in X; y
dynamics model and an ideal controller, as shown in Fig. 3. The deep is its acceleration in Y; and ψ is the angular acceleration about Z. The
guidance policy accepts the chaser state error et and outputs the accelerations are integrated twice to obtain the position and orienta-
commanded action, which in this case is the velocity vt : tion at the following time step.
The following subsection discusses the reward function used to
xt   x y ψ T (11) incentivize the desired behavior.

et  xd − xt (12) B. Reward Function


To calculate the reward given to the agent at each time step, a
vt  π θ et  (13) reward field f is generated according to

where xd is the desired state, and xt is the state of the system, where x fxt   −jet j (18)
and y represent the X and Y locations of the chaser, respectively, and
ψ represents the orientation of the chaser. The velocity vt is numeri- The reward field is zero at the desired state and becomes negative
cally integrated using the SciPy [44] Adams/backward differentia- linearly as the chaser moves away from the desired state.
tion formula methods in Python to obtain xt1 . The reward given to the agent, rt xt ; at , depends on the action
To evaluate the learning performance, the trained policy is peri- taken in a given state. Therefore, the difference in the reward field
odically evaluated on an environment with full dynamics and a between the current and previous time steps is used to calculate the
controller, as shown in Fig. 2. In other words, it is “deployed” to reward given to the agent:
another simulation for evaluation in much the same way that it will be
deployed to an experiment in Sec. VI. The deep guidance policy rt  kKfxt  − fxt−1 k (19)

The states are weighted with K  diagf0.5; 0.5; 0.1g to ensure the
rotational component does not dominate the reward field. By reward-
ing the change in reward field, a positive reward will be given to the
agent if it chooses an action that moves the chaser closer to the desired
state and a negative reward otherwise. Two additional penalties are
included to encourage the desired behavior: a velocity-error penalty
and a collision penalty. To avoid chaser oscillations, velocity errors
are penalized near the desired state. To avoid collisions, the reward is
reduced by rcollide  15 when the chaser collides with the target:
8
>
> kvt − vref k
>
< kKfxt  − fxt−1 k − c1 ke k  η − rcollide for kdt k ≤ 0.3
t
rt 
>
> kvt − vref k
>
: kKfxt  − fxt−1 k − c1 otherwise
ket k  η

Fig. 4 Spacecraft pose tracking and docking task. (20)


258 HOVELL AND ULRICH

a) Learning curve b) Loss


Fig. 5 Training performance for chaser pose tracking and docking for a stationary target with constant initial conditions.
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Here, vt is the chaser velocity, vref is the velocity of the hold/docking with an N-step return length of N  1. The TensorFlow‡ machine
point, η  0.01 is a small constant, c1  0.5 is used to weigh the learning framework was used to generate and train the neural networks.
velocity penalty such that it does not dominate the reward function, Every five training episodes, the current policy is deployed and run
and dt is the distance between the chaser and the target. in a full dynamics environment with the proportional velocity con-
The learning algorithm details are presented in the following troller in Eq. (14) to evaluate its performance, as shown in Fig. 2.
subsection. During deployment, σ  0 in Eq. (10) such that no exploration noise
is applied to the deep guidance velocity commands.
C. Learning Algorithm Details
The policy and the value neural networks have 400 neurons in their V. Simulation Results
first hidden layer and 300 neurons in their second layer; Fig. 1 is not to
scale. In the value network, the action is fed directly into the second To test the deep guidance approach, three variations of the space-
layer of the network, as this was empirically shown to be beneficial craft pose tracking and docking task are studied. The first uses a
[36,45]. The value network therefore has 138,951 trainable parame- stationary target, the second uses a rotating target, and the third uses a
ters ϕ and the policy network has 123,603 trainable parameters θ. rotating target with a stationary obstacle that must be avoided. The
first scenario is run both with constant initial conditions and with
Each neuron in both hidden layers uses a rectified linear unit as its
randomized initial conditions, whereas the second and third scenarios
nonlinear activation function, shown as follows:
use randomized initial conditions. The 30 cm cube spacecraft plat-
 form has a uniform simulated mass of 10 kg.
0 for y < 0
gy  (21)
y for y ≥ 0
A. Docking with a Stationary Target
In the output layer of the policy network, a gy  tanhy non- The chaser and target nominal initial conditions are 3 m; 1 m; 0 rad
linear activation function is used to force the commanded velocity to and 1.85 m; 0.6 m; π∕2 rad for the chaser and target, respectively.
be bounded. The output layer of the value neural network uses the The hold point is 1.0 m offset from the front face of the target, and the
docking point is 0.5 m offset. Each episode is run for 90 s with a 0.2 s
softmax function, shown as follows, to force the output value dis-
time step. For the first 45 s, the desired state is the hold point, and
tribution to indeed be a valid probability distribution:
afterward, it is the docking point. The commanded velocity bounds are
eyi 0.05 m∕s and π∕18 rad∕s.
gyi   PK ∀ i  1; : : : ; B (22) Results of the first spacecraft pose tracking and docking task are
k1 eyk shown in Fig. 5, and are the results of 47 h of training on an Intel® i7-
8700 K CPU. The learning curve, shown in Fig. 5a, plots the total
for each element yi and for B bins in the value distribution. Drawing rewards received on each episode, which increase, as expected,
from the original value distribution paper [37], B  51 bins are used during training. The deep guidance strategy successfully learned
in this work. The value distribution bins are evenly spaced on the the desired behavior after roughly 11,000 episodes. The loss func-
empirically determined interval −1000; 100, as this is the range of tion, calculated using Eq. (3) and shown in Fig. 5b, decreases as
accumulated rewards encountered during this pose tracking and anticipated, indicating that the value-network output distribution is
docking task. approaching the target values calculated using Eq. (4), on average.
The policy and value networks are trained using the Adam stochastic Sample trajectories during training are shown in Fig. 6. The gray
optimization routine [46] with a learning rate of α  β  0.0001. The object represents the initial pose of the chaser, the dashed line
replay buffer R can contain 106 transition data points, and a mini batch represents its trajectory, and the solid black object represents its final
size M  256 is used. The smoothed network parameters are updated pose. The pose tracking and docking task was successfully learned.
on each training iteration with ϵ  0.001. The noise standard deviation Next, the chaser and target initial conditions were randomized at
applied to the action to force exploration during training is σ  the beginning of each episode to force the deep guidance system to
1∕3maxa − mina0.9999E , where E is the episode number; generalize across a range of initial states and not simply master a
having a standard deviation of one-third the action range empirically single trajectory. The initial states were randomized around the
leads to good exploration of the action space. The noise standard nominal ones according to a normal distribution with a standard
deviation is decayed exponentially as more episodes are performed deviation of 0.3 m for position and π∕2 rad for attitude. Figure 7
to narrow the action search area, at a rate that halves roughly every 7000 shows the learning curve and associated loss function during training.
episodes. Ten actors are used, K  10, such that simulated data using The learning curve, shown in Fig. 7a, shows that the agent success-
the most up-to-date version of the policy are being collected by 10
actors simultaneously. A discount factor of γ  0.99 was used along ‡
Software available from https://www.tensorflow.org/.
HOVELL AND ULRICH 259
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Fig. 6 Visualization of chaser trajectories at various episodes during the training process.

a) Learning curve b) Loss


Fig. 7 Training performance for chaser pose tracking and docking for a stationary target with randomized initial conditions.

fully learned a more general guidance strategy in the presence of inertially moving with time. The learning curve in Fig. 9a shows that
randomized initial conditions. The learning curve appears noisier a deep guidance policy was successfully learned. Sample trajectories,
than the learning curve shown in Fig. 5a because each episode may shown in Fig. 10, show example motion of the chaser. The target
have slightly more or less rewards available depending on the initial performs two complete rotations during the episode so that its initial
conditions. Sample trajectories once training was complete are and final orientations are as shown.
shown in Fig. 8.
C. Docking While Avoiding an Obstacle
B. Docking with a Spinning Target This section presents a numerical simulation that is identical to
This subsection presents the second scenario that the deep guid- Sec. V.B except for the addition of a stationary obstacle that must be
ance system was trained on, that is, a spacecraft pose tracking and avoided. The input to the policy is modified to include the distance
docking task in the presence of a spinning target. All learning from the chaser to the obstacle, as follows:
parameters are identical to those presented in Sec. IV.C, demonstrat-
ing the generality of the proposed deep guidance approach. The target ot   eTt dTt T (23)
spacecraft is given a constant counterclockwise angular velocity of
ω  π∕45 rad∕s. The velocity bounds on the chaser are increased where dt are the X and Y distances from the chaser to the obstacle,
to 0.1 m∕s. Each episode is run for 180 s with a 0.2 s time step. For respectively. For this scenario, the deep guidance velocity is calcu-
the first 90 s, the chaser is incentivized to track the moving hold lated through
position, and afterward, it is rewarded for tracking the moving dock- vt  π θ ot  (24)
ing point. The initial conditions have a mean of 3 m; 1 m; 0 rad and
1.85 m; 1.2 m; 0 rad for the chaser and target, respectively, and a The rcollide penalty is also applied when the chaser collides with the
standard deviation of 0.3 m for position and π∕2 rad for attitude. target. Collision occurs when the center of mass of the chaser is
Because the target is rotating, the hold and docking points are within 0.3 m of the target. The position of the target is 1.2; 1.2 m.
260 HOVELL AND ULRICH
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Fig. 8 Examples of learned chaser trajectories with randomized initial conditions.

a) Learning curve b) Loss


Fig. 9 Training performance for chaser pose tracking and docking with a rotating target.

The learning curve in Fig. 11a shows that a deep guidance policy experiments are performed in a laboratory environment at Carleton
was successfully learned. Sample trajectories, shown in Fig. 12, show University. A planar gravity-offset testbed is used, where two space-
the motion of the chaser. A policy that causes the chaser to track the craft platforms are positioned on a flat granite surface. Air bearings
target while avoiding collision with the obstacle was successfully are used to provide a near-friction-free planar environment. The
learned. experimental facility is discussed, followed by the experimental
The deep guidance system presented in this paper was successfully setup and results.
trained on simulated spacecraft pose tracking and docking tasks. It
allows the designer to easily specify a reward function to convey the A. Experiment Facility
desired behavior to the agent instead of handcrafting a guidance
Experiments were conducted at the Spacecraft Robotics and Con-
trajectory. Although the guidance trajectories learned in this paper
trol Laboratory of Carleton University, using the Spacecraft Proxim-
by the deep guidance system would be trivial to handcraft, the
purpose of this paper is to introduce the deep guidance technique. ity Operations Testbed (SPOT). Specifically, SPOT consists of two
It is expected that the deep guidance technique will unlock the ability air-bearing spacecraft platforms operating in close proximity on a
for more difficult learned guidance strategies in future work. All 2.4 × 3.5 m granite surface. The use of air bearings on the platforms
codes used are open source and can be accessed at http://www. reduces the friction to a negligible level. Because of surface slope
github.com/Kirkados/JSR2020_D4PG. angles of 0.0026 and 0.0031 deg along both directions, residual
The trained guidance policies are tested in experiment in the gravitational accelerations of 0.439 and 0.525 mm∕s2 perturb the
following section. dynamics of the floating platforms along the X and Y directions,
respectively. Both platforms have dimensions of 0.3 × 0.3 × 0.3 m,
and are actuated by expelling compressed air at 550 kPa (80 psi)
VI. Experimental Validation through eight miniature air nozzles distributed around each platform,
To validate the numerical simulations and to test if the proposed thereby providing full planar control authority. Each thruster gener-
deep guidance strategy can overcome the simulation-to-reality gap, ates approximately 0.25 N of thrust and is controlled at a frequency of
HOVELL AND ULRICH 261
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Fig. 10 Examples of learned chaser trajectories with a rotating target.

a) Learning curve b) Loss


Fig. 11 Training performance for chaser pose tracking and docking while avoiding an obstacle.

500 Hz by a pulse-width-modulation scheme using solenoid valves. telemetry data (i.e., any signals of interest, as specified by the
Pressurized air for the thrusters and the air bearing flotation system is user) from all onboard computers, for post-experiment analysis
stored onboard in a single air cylinder at 31 MPa (4500 psi). The purposes.
structure consists of an aluminum frame with four corner rods on A MATLAB/Simulink® numerical simulator that recreates the
which three modular decks are stacked. To protect the internal com- dynamics and emulates the different onboard sensors and actuators
ponents, the structure is covered with semitransparent acrylic panels. is first used to design and test the upcoming experiment. Once the
Figure 13a shows the SPOT laboratory facility, and Fig. 13b shows performance in simulations is satisfactory, the control software is
two SPOT platforms in a proximity operations configuration. converted into C/C++ using Embedded Coder®, compiled, and then
The motion of both platforms is measured in real time through four executed on the Raspberry Pi-3 computers of the platforms.
active light-emitting diodes (LEDs) on each platform, which are An NVIDIA Jetson TX2 Module is used to run the trained deep
tracked by an eight-camera PhaseSpace© motion capture system. guidance policy neural network in real time. It accepts the current
This provides highly accurate ground-truth position and attitude data. system state and returns a guidance velocity signal to the Raspberry
All motion capture cameras are connected to a PhaseSpace server, Pi-3 that executes a control law to track the commanded velocity.
which is connected to a ground station computer. The ground station
computer wirelessly communicates ground truth information to the
onboard computers of the platforms, which consist of Raspberry Pi 3s B. Setup
running the Raspbian Linux operating system. Based on the position The simulations presented in Sec. V were chosen such that they are
and attitude data the platforms wirelessly receive, they can perform replicable experimentally. In other words, all three pose tracking and
feedback control by calculating the required thrust to maneuver docking tasks with randomized initial conditions are attempted in
autonomously and actuating the appropriate solenoid valves to real- experiment. The final parameters θ of the trained deep guidance
ize this motion. The ground station computer also receives real-time policies are exported for use on the chaser SPOT platform. The value
262 HOVELL AND ULRICH
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

Fig. 12 Examples of learned chaser trajectories while avoiding an obstacle.

desired initial conditions. Then, the target remains stationary or


begins rotating while the chaser platform uses the deep guidance
policy trained in simulation to guide itself toward the hold point and
finally the docking point on the target.
It should be noted that a significant number of discrepancies exist
between the simulated environment the policy was trained within and
the experimental facility. The simulated environment did not account
for the discrete thrusters and their limitations, the control thrust
allocation strategy, signal noise, system delays, friction, air resis-
tance, center of mass offsets, thruster plume interaction, and table
slope. In addition, the spacecraft mass used to evaluate the training
was 10 kg, whereas the experimental spacecraft platforms have a
mass of 16.9 kg. Significant discrepancies exist between the simu-
lated training environment and the experimental environment, which
is an excellent test for the simulation-to-reality capabilities of the
proposed deep guidance technique. Because of facility size limita-
tions, the hold point was reduced from 1.0 to 0.9 m offset from the
front face of the target.
a) An overview of the SPOT facility

C. Results
A trajectory of the experiment with a stationary target is shown in
Fig. 14a. It shows the the deep guidance successfully outputs velocity
commands that bring the chaser to the hold point, and then additional
velocity commands to move toward the target docking port. A
trajectory of the experiment with a rotating target is shown in Fig. 14b,
and a trajectory of the experiment with an obstacle and a rotating
target is shown in Fig. 14c. The learned deep guidance technique
successfully commands an appropriate velocity signal for the chaser
to complete the tasks.
b) SPOT platforms The deep guidance technique was successfully trained exclusively
Fig. 13 Spacecraft Proximity Operations Testbed. in simulation and deployed to an experimental facility, and achieved
similar performance to that during training. Combining the neural-
network guidance with conventional control allowed the trained
network is only used during training and is not exported along with system to handle unmodeled effects present in the experiment. The
the policy network. Initial conditions are similar to those used proposed technique successfully demonstrates the deep guidance
previously in simulation. technique as a viable solution to the simulation-to-reality problem
The platforms remain in contact with the table until a strong lock present in deep reinforcement learning. A video of the simulated and
has been acquired on the LEDs by the motion capture system. experimental results can be found in supplemental video S1 or online
Following this, the platforms begin to float and maneuver to the at https://youtu.be/n7 K6aC5v0aY.
HOVELL AND ULRICH 263

a) Stationary target b) Rotating target


Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

c) Rotating target with obstacle avoidance


Fig. 14 Experimental trajectories of deep guidance in SPOT of Carleton University.

VII. Conclusions [4] Ma, Z., Ma, O., and Shashikanth, B. N., “Optimal Approach to and
Alignment with a Rotating Rigid Body for Capture,” Journal of the
This paper introduced deep reinforcement learning to the guidance Astronautical Sciences, Vol. 55, No. 4, 2007, pp. 407–419.
problem for spacecraft robotics. Through training a guidance policy to https://doi.org/10.1007/BF03256532
accomplish a goal, complex behaviors can be learned rather than [5] Dong, H., Hu, Q., and Akella, M. R., “Dual-Quaternion-Based Space-
handcrafted. The simulation-to-reality gap dictates that policies trained craft Autonomous Rendezvous and Docking Under Six-Degree-of-
in simulation often do not transfer well to reality. To avoid this, and to Freedom Motion Constraints,” Journal of Guidance, Control, and
avoid additional training once deployed to a physical robot, the authors Dynamics, Vol. 41, No. 5, 2018, pp. 1150–1162.
restrict the policy to output a guidance signal, which a conventional https://doi.org/10.2514/1.G003094
controller is tasked with tracking. Conventional control can handle [6] Filipe, N., and Tsiotras, P., “Adaptive Position and Attitude Tracking
Controller for Satellite Proximity Operations Using Dual Quaternions,”
the modeling errors that typically plague reinforcement learning. This Journal of Guidance, Control, and Dynamics, Vol. 38, No. 4, 2015,
paper tests this learned guidance technique, which the authors call pp. 566–577.
deep guidance, on a simple problem, that is, spacecraft pose tracking https://doi.org/10.2514/1.G000054
and docking. Numerical simulations show that proximity operation [7] Gui, H., and Vukovich, G., “Finite-Time Output-Feedback Position and
tasks can be successfully learned using the deep guidance technique. Attitude Tracking of a Rigid Body,” Automatica, Vol. 74, Dec. 2016,
The trained policies are then deployed to three experiments, with pp. 270–278.
comparable results to those in simulation, even though the simulated https://doi.org/10.1016/j.automatica.2016.08.003
environment did not model all effects present in the experimental [8] Pothen, A. A., and Ulrich, S., “Close-Range Rendezvous with a Moving
facility. Future work will further explore the generality of the technique Target Spacecraft Using Udwadia-Kalaba Equation,” American Control
Conference, IEEE Publ., Piscataway, NJ, 2019, pp. 3267–3272.
and its use on more-difficult problems. https://doi.org/10.23919/ACC.2019.8815115
[9] Pothen, A. A., and Ulrich, S., “Pose Tracking Control for Spacecraft
Acknowledgment Proximity Operations Using the Udwadia-Kalaba Framework,” AIAA
Guidance, Navigation, and Control Conference, AIAA Paper 2020-
This research was financially supported in part by the Natural 1598, Jan. 2020.
Sciences and Engineering Research Council of Canada under the https://doi.org/10.2514/6.2020-1598
Postgraduate Scholarship—Doctoral PGSD3-503919-2017 award. [10] Hough, J., and Ulrich, S., “Lyapunov Vector Fields for Thrust-Limited
Spacecraft Docking with an Elliptically-Orbiting Uncooperative Tum-
bling Target,” AIAA Guidance, Navigation, and Control Conference,
References AIAA Paper 2020-2078, Jan. 2020.
[1] Flores-Abad, A., Ma, O., Pham, K., and Ulrich, S., “A Review of Space https://doi.org/10.2514/6.2020-2078
Robotics Technologies for On-Orbit Servicing,” Progress in Aerospace [11] Hornik, K., Stinchcombe, M., and White, H., “Multilayer Feedforward
Sciences, Vol. 68, July 2014, pp. 1–26. Networks Are Universal Approximator,” Neural Networks, Vol. 2,
https://doi.org/10.1016/j.paerosci.2014.03.002 No. 5, 1989, pp. 359–366.
[2] Aghili, F., “A Prediction and Motion-Planning Scheme for Visually https://doi.org/10.1016/0893-6080(89)90020-8
Guided Robotic Capturing of Free-Floating Tumbling Objects with [12] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare,
Uncertain Dynamics,” IEEE Transactions on Robotics, Vol. 28, No. 3, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen,
2012, pp. 634–649. S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D.,
https://doi.org/10.1109/TRO.2011.2179581 Wierstra, D., Legg, S., and Hassabis, D., “Human-Level Control
[3] Wilde, M., Ciarcià, M., Grompone, A., and Romano, M., “Experimental Through Deep Reinforcement Learning,” Nature, Vol. 518, No. 7540,
Characterization of Inverse Dynamics Guidance in Docking with a 2015, pp. 529–533.
Rotating Target,” Journal of Guidance, Control, and Dynamics, Vol. 39, https://doi.org/10.1038/nature14236
No. 6, 2016, pp. 1173–1187. [13] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M., “The Arcade
https://doi.org/10.2514/1.G001631 Learning Environment: An Evaluation Platform for General Agents,”
264 HOVELL AND ULRICH

Journal of Artificial Intelligence Research, Vol. 47, June 2013, pp. 253– [29] Gaudet, B., and Furfaro, R., “Adaptive Pinpoint and Fuel Efficient Mars
279. Landing Using Reinforcement Learning,” IEEE/CAA Journal of Auto-
https://doi.org/10.1613/jair.3912 matica Sinica, Vol. 1, No. 4, 2014, pp. 397–411.
[14] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den https://doi.org/10.1109/JAS.2014.7004667
Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., [30] Furfaro, R., Simo, J., Gaudet, B., and Wibben, D. R., “Neural-Based
Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Trajectory Shaping Approach for Terminal Planetary Pinpoint Guid-
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., ance,” AAS/AIAA Astrodynamics Specialist Conference, AAS Paper 13-
and Hassabis, D., “Mastering the Game of Go with Deep Neural Net- 875, Aug. 2013.
works and Tree Search,” Nature, Vol. 529, No. 7587, 2016, pp. 484–489. [31] Sánchez-Sánchez, C., and Izzo, D., “Real-Time Optimal Control via
https://doi.org/10.1038/nature16961 Deep Neural Networks: Study on Landing Problems,” Journal of Guid-
[15] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., ance, Control, and Dynamics, Vol. 41, No. 5, 2018, pp. 1122–1135.
Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap,
https://doi.org/10.2514/1.G002357
T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis,
[32] Ulrich, S., Saenz-Otero, A., and Barkana, I., “Passivity-Based Adaptive
D., “Mastering the Game of Go Without Human Knowledge,” Nature,
Control of Robotic Spacecraft for Proximity Operations Under Uncer-
Vol. 550, No. 7676, 2017, pp. 354–359.
https://doi.org/10.1038/nature24270 tainties,” Journal of Guidance, Control, and Dynamics, Vol. 39, No. 6,
[16] Kober, J., Bagnell, J. A., and Peters, J., “Reinforcement Learning in 2016, pp. 1444–1453.
Robotics: A Survey,” International Journal of Robotics Research, https://doi.org/10.2514/1.G001491
Vol. 32, No. 11, 2013, pp. 1238–1274. [33] Jafarnejadsani, H., Sun, D., Lee, H., and Hovakimyan, N., “Optimized
https://doi.org/10.1177/0278364913495721 L1 Adaptive Controller for Trajectory Tracking of an Indoor Quadro-
[17] Tai, L., Paolo, G., and Liu, M., “Virtual-to-Real Deep Reinforcement tor,” Journal of Guidance, Control, and Dynamics, Vol. 40, No. 6, 2017,
Learning: Continuous Control of Mobile Robots for Mapless Naviga- pp. 1415–1427.
tion,” IEEE International Conference on Intelligent Robots and Systems, https://doi.org/10.2514/1.G000566
Downloaded by ASELSAN - A.S. on September 14, 2022 | http://arc.aiaa.org | DOI: 10.2514/1.A34838

IEEE Publ., Piscataway, NJ, 2017, pp. 31–36. [34] Huang, P., Wang, D., Meng, Z., Zhang, F., and Guo, J., “Adaptive
https://doi.org/10.1109/IROS.2017.8202134 Postcapture Backstepping Control for Tumbling Tethered Space Robot-
[18] Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Target Combination,” Journal of Guidance, Control, and Dynamics,
Upcroft, B., Abbeel, P., Burgard, W., Milford, M., and Corke, P., “The Vol. 39, No. 1, 2016, pp. 150–156.
Limits and Potentials of Deep Learning for Robotics,” International https://doi.org/10.2514/1.G001309
Journal of Robotics Research, Vol. 37, Nos. 4–5, 2018, pp. 405–420. [35] Harris, A., Teil, T., and Schaub, H., “Spacecraft Decision-Making
https://doi.org/10.1177/0278364918770733 Autonomy Using Deep Reinforcement Learning,” AAS/AIAA Space
[19] Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalak- Flight Mechanics Meeting, AAS Paper 19-447, Jan. 2019.
rishnan, M., Downs, L., Ibarz, J., Pastor, P., Konolige, K., Levine, S., [36] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D.,
and Vanhoucke, V., “Using Simulation and Domain Adaptation to Dhruva, T. B., Muldal, A., Heess, N., and Lillicrap, T., “Distributed
Improve Efficiency of Deep Robotic Grasping,” IEEE International Distributional Deterministic Policy Gradients,” International Conference
Conference on Robotics and Automation, IEEE Publ., Piscataway, NJ,
on Learning Representations, Vancouver, Canada, 2018.
2018, pp. 4243–4250.
[37] Bellemare, M. G., Dabney, W., and Munos, R., “A Distributional
https://doi.org/10.1109/ICRA.2018.8460875
Perspective on Reinforcement Learning,” International Conference on
[20] Andrychowicz, O. A. M., Baker, B., Chociej, M., Józefowicz, R.,
McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Machine Learning, PMLR, Sydney, Australia, 2017, pp. 449–458.
Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W., [38] Mnih, V., Badia, A., Mirza, M., Graves, A., and Lillicrap, T., “Asyn-
“Learning Dexterous In-Hand Manipulation,” International Journal of chronous Methods for Deep Reinforcement Learning,” International
Robotics Research, Vol. 39, No. 1, 2020, pp. 3–20. Conference on Machine Learning, PMLR, New York, 2016, pp. 1928–
https://doi.org/10.1177/0278364919887447 1937.
[21] Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., [39] Sutton, R. S., and Barto, A. G., Reinforcement Learning: An Introduc-
and Scaramuzza, D., “Deep Drone Racing: From Simulation to Reality tion, 2nd ed., MIT Press, Cambridge, MA, 1998, p. 148.
with Domain Randomization,” IEEE Transactions on Robotics, Vol. 36, [40] Zappulla, R., Park, H., Virgili-Llop, J., and Romano, M., “Real-Time
No. 1, 2020, pp. 1–14. Autonomous Spacecraft Proximity Maneuvers and Docking Using an
https://doi.org/10.1109/TRO.2019.2942989 Adaptive Artificial Potential Field Approach,” IEEE Transactions on
[22] Cutler, M., and How, J. P., “Autonomous Drifting Using Simulation-Aided Control Systems Technology, Vol. 27, No. 6, 2019, pp. 2598–2605.
Reinforcement Learning,” IEEE International Conference on Robotics https://doi.org/10.1109/TCST.2018.2866963
and Automation, IEEE Publ., Piscataway, NJ, 2016, pp. 5442–5448. [41] Ciarcià, M., Grompone, A., and Romano, M., “A Near-Optimal Guid-
https://doi.org/10.1109/ICRA.2016.7487756 ance for Cooperative Docking Maneuvers,” Acta Astronautica, Vol. 102,
[23] Hwangbo, J., Sa, I., Siegwart, R., and Hutter, M., “Control of a Quad- Sept. 2014, pp. 367–377.
rotor with Reinforcement Learning,” IEEE Robotics and Automation https://doi.org/10.1016/j.actaastro.2014.01.002
Letters, Vol. 2, No. 4, 2017, pp. 2096–2103. [42] Mammarella, M., Capello, E., Park, H., Guglieri, G., and Romano, M.,
https://doi.org/10.1109/LRA.2017.2720851 “Tube-Based Robust Model Predictive Control for Spacecraft Proximity
[24] Julian, K. D., and Kochenderfer, M. J., “Distributed Wildfire Surveil- Operations in the Presence of Persistent Disturbance,” Aerospace Sci-
lance with Autonomous Aircraft Using Deep Reinforcement Learning,” ence and Technology, Vol. 77, June 2018, pp. 585–594.
Journal of Guidance, Control, and Dynamics, Vol. 42, No. 8, 2019, https://doi.org/10.1016/j.ast.2018.04.009
pp. 1768–1778. [43] Saulnier, K., Pérez, D., Huang, R. C., Gallardo, D., Tilton, G., and
https://doi.org/10.2514/1.G004106 Bevilacqua, R., “A Six-Degree-of-Freedom Hardware-in-the-Loop
[25] Chan, D. M., and Agha-Mohammadi, A. A., “Autonomous Imaging and
Simulator for Small Spacecraft,” Acta Astronautica, Vol. 105, No. 2,
Mapping of Small Bodies Using Deep Reinforcement Learning,” IEEE
Aerospace Conference, IEEE Publ., Piscataway, NJ, 2019. 2014, pp. 444–462.
https://doi.org/10.1109/AERO.2019.8742147 https://doi.org/10.1016/j.actaastro.2014.10.027
[26] Willis, S., Izzo, D., and Hennes, D., “Reinforcement Learning for [44] Oliphant, T. E., “Python for Scientific Computing,” Computing in
Spacecraft Maneuvering Near Small Bodies,” AAS/AIAA Space Flight Science & Engineering, Vol. 9, No. 3, 2007, pp. 10–20.
Mechanics Meeting, AAS Paper 16-277, Feb. 2016, pp. 1351–1368. https://doi.org/10.1109/MCSE.2007.58
[27] Lafarge, N. B., Miller, D., Howell, K. C., and Linares, R., “Guidance for [45] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver,
Closed-Loop Transfers Using Reinforcement Learning with Applica- D., and Wierstra, D., “Continuous Control with Deep Reinforcement
tion to Libration Point Orbits,” AIAA Guidance, Navigation, and Con- Learning,” International Conference on Learning Representations, San
trol Conference, AIAA Paper 2020-0458, Jan. 2020. Juan, Puerto Rico, 2016.
https://doi.org/10.2514/6.2020-0458 [46] Kingma, D. P., and Ba, J., “Adam: A Method for Stochastic Optimiza-
[28] Scorsoglio, A., Furfaro, R., Linares, R., and Gaudet, B., “Image- tion,” International Conference on Learning Representations, San
Based Deep Reinforcement Learning for Autonomous Lunar Landing,” Diego, CA, 2015.
AIAA Guidance, Navigation, and Control Conference, AIAA Paper
2020-1910, Jan. 2020. I. I. Hussein
https://doi.org/10.2514/6.2020-1910 Associate Editor

You might also like