Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning
Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning
Reinforcement Learning
Yuxuan Jiang2 , Yujie Yang2 , Zhiqian Lan2 , Guojian Zhan2 , Shengbo Eben Li*1,2 ,
Qi Sun2 , Jian Ma3 , Tianwen Yu3 , Changwu Zhang3
navigates the environment for the guide horizon, followed by the Fig. 1. Brief task demonstration of rocket landing control
exploration policy taking charge to complete remaining steps.
This jump-start strategy prunes exploration space, rendering the cumulative rewards. It has shown remarkable success in var-
problem more tractable to RL algorithms. The guide horizon ious domains, including video games [1], board games [2],
is sampled from a uniform distribution, with its upper bound robotics [3], and autonomous driving [4]. In these problems,
annealing to zero based on performance metrics, mitigating well-crafted dense reward signals are critical for RL agents to
distribution shift and mismatch issues in existing methods.
Additional enhancements, including cascading jump start, refined progressively improve their policies. However, rocket landing
reward and terminal condition, and action smoothness regulation, poses a unique challenge for RL. It is a goal-oriented problem,
further improve policy performance and practical applicability. requiring the agent to reach a specific goal set in the state
The proposed method is validated through extensive evaluation space, with intermediate reward signals not readily available.
and Hardware-in-the-Loop testing, affirming the effectiveness, Standard RL algorithms, such as PPO [5] , SAC [6] and
real-time feasibility, and smoothness of the proposed controller.
DSAC [7], fail in such scenarios due to their reliance on
I. I NTRODUCTION extensive exploration; the likelihood of randomly reaching the
goal diminishes exponentially over time.
Rocket recycling is a pivotal pursuit in the field of aerospace
General goal-oriented tasks without any prior knowledge
technology, driven by its potential to significantly reduce costs
present intrinsic difficulties. Several algorithm categories, such
and mitigate the environmental impact of space exploration.
as intrinsic reward methods [8], [9] and efficient exploration
Central to this endeavor is the challenge of rocket landing
methods [10], [11] have been developed to optimize policies in
control, a task that involves guiding a nonlinear underactuated
such contexts. However, they still face significant challenges
rocket plant with limited fuel to land in real-time. The con-
in scenarios with hard-to-reach goals and large continuous
troller must overcome large random disturbances, and satisfy
action spaces, requiring an exponential increase in sample
strict terminal state requirement at landing. Conventional con-
complexity. In many cases, some prior knowledge of the
trol strategies, such as PID controllers, require tedious tuning
target environment does exist, and leveraging this knowledge
to meet the accuracy prerequisites, while optimization methods
can accelerate learning and enhance performance. Reward
face difficulties in meeting the real-time constraints of this
shaping is a common method for integrating such knowl-
task. Reinforcement learning (RL), employing offline training
edge. Although manually designed rewards are adaptable
and online implementation (OTOI) model, presents a viable
and straightforward to implement, they necessitate substantial
solution for addressing these complex challenges.
effort to balance multiple factors. In OpenAI Five [12], for
RL offers a powerful paradigm to iteratively optimize
instance, over twenty shaped reward components are carefully
policies for control problems by maximizing the expected
weighted to achieve professional-level performance. Another
This study is supported by Tsinghua University Initiative Scientific Re- approach, prevalent in robotic control tasks, involves using
search Program, and NSF China with U20A20334 and 52072213. a reference trajectory with a quadratic tracking reward. For
1 College of Artificial Intelligence, Tsinghua University, Beijing, 100084,
rocket landing control, designing feasible trajectories demands
China. 2 School of Vehicle and Mobility, Tsinghua University, Beijing, 100084,
China. 3 LandSpace Technology Corporation, Beijing, 100176, China. * All expert knowledge [13], and a single fixed trajectory becomes
correspondences should be sent to S. E. Li with e-mail: [email protected] impractical under variable initial conditions and environmental
disturbances. Alternatively, some methods utilize more specific TABLE I
prior knowledge, like offline datasets or prior policies, to M AIN STATES , INITIAL VALUES AND TERMINAL CONSTRAINTS
aid exploration and learning. This class of methods typically
State Unit Initial value & range Terminal value & range
target initialization, either initializing the policy before RL
training [14], [15] or initializing the state before each episode x m 50 ± 500 0±5
y m 2000 ± 10 0
[16], [17]. One simple and effective approach is jump start z m 0 ± 500 0±5
RL (JSRL) [18], which initialize each episode by applying vx m/s −10 ± 50 0±1
a guide policy over the guide horizon before transitioning to vy m/s −300 ± 50 −1 ∼ 0
vz m/s 0 ± 50 0±1
the exploration RL policy for the remainder of the task. The ϕ ◦ 90 ± 0.5 90 ± 3
effectiveness of the jump start technique hinges on the design ψ ◦ 0 ± 0.5 0±3
γ ◦ 0 ± 0.5 -
of the guide horizon, with two variants, JSRL-Curriculum and ◦ /s
ϕ̇ 0 ± 0.1 0 ± 1.5
JSRL-Random, proposed in JSRL paper. ◦ /s
ψ̇ 0 ± 0.1 0 ± 1.5
In this paper, we build on the jump start framework, and γ̇ ◦ /s 0 ± 0.1 -
introduce the Random Annealing Jump Start (RAJS) approach. m kg 50000 ± 500 ≥ 40000
RAJS is particularly suitable for real-world control problems
like rocket landing, which often involve some form of prior control. The task switches to the second stage when the vertical
feedback controllers based on PID or other classical control velocity vy reaches a threshold of vsw = −60 m/s, where
methods. For each episode, RAJS samples the guide horizon the internal low-level control mechanism closes two of three
uniformly between zero and an upper bound, and anneals engines, and continue the mission until landing. This internal
the upper bound to zero during training. This minimizes switch does not affect the controller-rocket interaction directly,
state distribution shifts, addressing the limitations of JSRL- but would be reflected in the model dynamics.
Curriculum and broadening the applicability of RL algorithms The key states of the rocket plant include position x, y, z,
from offline, off-policy to on-policy variants. Additionally, velocity vx , vy , vz , angular position ϕ, ψ, γ, angular velocity
the final initial state distribution aligns with the underlying ϕ̇, ψ̇, γ̇, and total mass m, which dynamically reduces as
environment, countering the distribution mismatch issue seen fuel is consumed. Each simulation run allows these state
in JSRL-Random. The stability afforded by RAJS simplifies variables to initialize within specified ranges, as shown in
the annealing schedule, allowing for either automatic schedul- Table I. The constraints for the terminal state are defined to
ing based on training metric feedback or manual adjustment ensure a precise landing within a narrow vicinity of the target,
using a clamped ramp function. We also integrate several along with parameters critical for maintaining landing stability,
generally applicable practical techniques, including cascading also delineated in Table I. The state observation includes 12
jump start, refined reward and terminal condition design, kinematic variables (encompassing position, velocity, angular
and action smoothness regulation. These enhancements enable position, and angular velocity) and the axial load. It is im-
us to effectively tackle the rocket landing control problem, portant to note that certain plant states, such as those related
elevating the success rate from 8% with the baseline controller to engine response, remain hidden from the controller. The
to an impressive 97%. Extensive evaluation reveals that the control inputs are modeled as three engine attitude signals
control maneuvers are both smooth and interpretable, and and one engine thrust signal, normalized within −1 to 1.
Hardware-in-the-Loop testing confirms the real-time feasibility The simulation terminates upon the occurrence of one of the
of the control signals produced by our controller. following events: either a successful or failed landing, fuel
exhaustion, and vertical speed reversal.
II. P RELIMINARY
Additional complexity is introduced in the form of wind
A. Problem Statement disturbance, characterized by the wind speed that is uniformly
Illustrated in Fig. 1, our study focuses on the critical phase sampled within a range of 0 m/s to 15 m/s, and the wind
of vertical landing within the lower atmospheric layer, which direction from any direction within the horizontal plane.
presents the highest demand for control precision in rocket Crucially, these wind parameters remain unobservable to the
recycling missions. In the context of controller operations, the controller, necessitating a control algorithm that possesses
fundamental objective of the rocket landing task is to guide the robustness to such external perturbations.
rocket from diverse initial states to land on the ground, while A baseline controller is integrated within the plant for estab-
satisfying various terminal state constraints. The plant is a lishing a closed-loop simulation environment. This controller,
high-fidelity rocket model built in Simulink from LandSpace, based on several reference trajectories and PID control mech-
containing calibrated subsystems for inertia, engine, actuator, anisms, demonstrates the capability to maintain pose stability
aerodynamics and disturbance. The coordinate system used for and manage the vertical descent under normal conditions.
the rocket landing problem has its origin on the ground. The However, its limited adaptability in the face of disturbances
y-axis points vertically upward, the x-axis points toward the results in frequent constraint violations and a modest overall
north, and the z-axis points toward the east. success rate of 8%. Hence, although the baseline controller of-
The task consists of two stages. On the first stage, all three fers valuable reference for policy learning, the algorithm must
engines of the rocket are available for deceleration and pose refrain from mere imitation and instead employ it strategically.
B. Goal-oriented Markov decision process III. M ETHOD
Reinforcement Learning (RL) is based on the concept of A. Jump start framework
a Markov Decision Process (MDP), where the optimal action
The jump start framework comprises a fixed guide policy
solely depends on the current state [19]. To elaborate further,
πg (a|s) and a learnable exploration policy πe (a|s). In each
at each time step denoted as t, the agent selects an action
episode, the guide policy first navigates the environment for
at ∈ A in response to the current state st ∈ S and the policy
the guide horizon H steps, after which the exploration policy
π : S → A. Subsequently, the environment transits to the
πe takes over to complete the remaining steps. Selecting H
next state according to its dynamics, represented as st+1 =
close to the full task horizon reduces the exploration space of
f (st , at ), and provides a scalar reward signal denoted as rt .
πe , enhancing the likelihood of reaching the goal.
The value function v π : S → R is defined as the expected
The effectiveness of the jump start framework hinges on the
cumulative sum of rewards generated by following policy π
design of the guide horizon H. In essence, jump start modifies
when initiated from an initial state s.
the initial state distribution dinit of the original MDP to an
In the context of rocket landing control, we can abstractly
easier d˜init produced by the guide policy. d˜init depends solely
model it as a goal-oriented MDP. The agent is required to reach
on the guide horizon H, which can be a constant or a random
the goal set Sgoal from the initial set Sinit while maximizing
variable. The gap between d˜init during training and dinit during
discounted return defined as:
" T # practical evaluation can negatively impact policy performance,
max Eτ ∼π
X
t
γ r(st ) , necessitating careful design of H. The distribution mismatch
π
t=0
coefficient (Definition 1) is a useful metric to quantify such
gap.
where γ is the discount factor, and the goal-oriented reward
function is expressed as: Definition 1 (Distribution mismatch coefficient [21]). The
distribution mismatch coefficient D∞ :
(
1 st ∈ Sgoal
r(st ) = . dπρ
0 st ∈
/ Sgoal D∞ =
An episode terminates when state st enters Sterm = Sgoal ∪Sfail , µ ∞
where Sfail represents a collection of states disjoint with Sgoal quantifies the effect of policy gradient learning under a initial
where continuation of the simulation is infeasible. state distribution µ possibly deviated from the initial state
C. Proximal policy optimization distribution of interest ρ. dπρ refer to the discounted stationary
state distribution under policy π:
Proximal policy optimization (PPO) [5] is an on-policy
∞
" #
RL algorithm rooted in the policy gradient method, featur- X
ing straightforward implementation, parallel sampling, and dπρ = Es0 ∼ρ (1 − γ) γ t Prπ (st = s|s0 ) .
stability during training. PPO finds extensive application in t=0
addressing complex challenges, such as high-dimensional con- The JSRL paper introduces two variants: JSRL-Curriculum
tinuous control tasks, the development of advanced AI for and JSRL-Random. JSRL-Curriculum dynamically adjusts H
professional-level gaming, and the recent advancements in as a function of training progress using a pre-designed cur-
reinforcement learning through human feedback for refining riculum. In practical terms, a step function is employed:
large language models. The core of PPO lies in its objective
k
function, as expressed by: Hk = 1 − H̄, k = 0, 1, . . . , n,
n
1 X
Jπ (θ) = min (ρ(θ)Aπold (x, u),
|D| (1) where H̄ is the initial guide horizon, n is the number of
x,u∈D
curriculum stages, and k is incremented based on the moving
clip (ρ(θ), 1 − ϵ, 1 + ϵ) Aπold (x, u)) . average of evaluated performance. The modified initial state
Here, θ denotes the parameter of the policy network, ρ(θ) = distribution d˜init,k associated with Hk ensures that d˜init,n is
πθ (u|x) πold
πold (u|x) represents the importance sampling factor, and A is equivalent to dinit , benefiting evaluation performance. How-
the advantage function computed using generalized advantage ever, the transition between stages introduces a notable flaw
estimation [20]. PPO also incorporates the learning of a state of distribution shift, adding unseen states to the space and
value network to estimate vπ , which serves as a baseline changing the initial state distribution at each stage switch.
in generalized advantage estimation, contributing to variance JSRL paper focuses on transitioning from offline RL to
reduction. While PPO excels in exploration capabilities, it en- online RL, employing implicit Q Learning (IQL) [22] as the
counters significant challenges when addressing goal-oriented optimization algorithm in both stages. While IQL mitigates
environments independently, primarily due to the exponential the distribution shift to some extent through its replay buffer,
sample complexity inherent in such scenarios. To overcome on-policy algorithms like PPO face significant challenges in
this obstacle effectively, it is necessary to combine PPO with policy updates due to this shift. Particularly in tasks such as
our proposed RAJS method, thus facilitating the efficient rocket landing control, where initial state distributions d˜init for
attainment of satisfactory solutions. different guide horizons H can be completely disjoint, this
would disrupt policy learning, rendering the current policy tasks, the proportion of episodes terminating in the goal set (or
unlearned. success rate), denoted as Pgoal , serves as a suitable metric. The
On the other hand, JSRL-Random samples the guide horizon update rule for β(Pgoal ) at the end of each training iteration
H from a uniform distribution U (0, H̄), where H̄ is the is as follows:
horizon upper bound. Unlike JSRL-Curriculum, the initial
β ← max(β − αI(Pgoal ≥ Pthresh ), 0), (3)
state distribution induced by the random variable H remains
fixed throughout training, essentially creating an alternative where Pthresh a tunable performance threshold, I(·) is the
stationary MDP with an adjusted initial state distribution. This indicator function, and α is the update step size. Additionally,
approach enhances stability during training compared to JSRL- due to improved training stability, designing a ramp schedule
Curriculum but introduces a different challenge of distribution β(N ) manually becomes trivial, with N representing the total
mismatch. Although it is well-established that the optimal number of environment interactions. The start and end steps
policy π ∗ for an MDP is independent of specific dinit under of the ramp can be determined by solving the task once with
mild conditions, different dinit do affect the iteration complex- β ≡ 1 and observing the success rate training curve.
ity, particularly when function approximation is involved. As RAJS relaxes JSRL’s guide policy requirements, extending
proven by [21], the suboptimality V ∗ (s) − V π (s) of policy its applicability to on-policy RL algorithms, as outlined in
gradient after T iterations is positively correlated to D∞ ϵapprox . Algorithm 1. In this paper, our focus is on PPO discussed in
Here, ϵapprox is the approximation error, representing the min- the preliminary section, leveraging its general applicability to
imal possible error for the given parametric function to fit the address the complex challenge of rocket landing control.
target distribution. This correlation implies that, for a fixed
positive ϵapprox of a neural network function approximation, Algorithm 1 Random annealing jump start w/ on-policy RL
a larger distribution mismatch coefficient D∞ contributes to 1: Input: guide policy πg , maximum guide horizon H̄, metric
slower convergence. In the context of solving goal-oriented threshold Pthresh , annealing step size α, training batch size B.
2: Initialize exploration-policy πe , and other required function ap-
problems with jump start, ρ corresponds to dinit , and µ
proximation, e.g., state value function V . Initialize annealing
corresponds to d˜init of the chosen jump start schedule. Within factor β ← 1, and moving mean metric P ← 0.
the support of dinit , JSRL-Random’s d˜init is much smaller than 3: procedure ROLLOUT(πg , πe , H̄, D, P )
dinit , leading to a larger D∞ . This explains the experimental 4: Sample initial state s0 ∼ dinit , guide horizon H ∼ U (0, H̄).
observation that JSRL-Random’s policy performance on the 5: Rollout ⌊H⌋ steps with guide policy πg .
6: Rollout until termination with exploration policy πe , logging
evaluation distribution is significantly inferior to optimality.
trajectory {(s, a, r), . . .} to D.
7: Update P with metric of current episode.
B. Random annealing jump start 8: end procedure
Building upon the insights gained from the analysis pre- 9: repeat
sented earlier, and considering the effectiveness of the jump 10: Initialize trajectory dataset D = {}.
11: Sample with ROLLOUT(πg , πe , H̄β, D, P ), until |D| ≥ B.
start framework in challenging goal-oriented environments 12: πe , V ← T RAIN P OLICY(πe , V, D).
when used with on-policy algorithms, we introduce Random 13: β ← max(β − αI(P ≥ Pthresh ), 0).
Annealing Jump Start (RAJS). This approach addresses the 14: until β = 0 and convergence
limitations of distribution shift and distribution mismatch
observed in prior works. RAJS achieves this by sampling the
guide horizon from a uniform distribution, similar to JSRL- C. Practical techniques
Random. However, a key distinction lies in initializing the Several practical techniques are introduced to further en-
upper bound of the uniform distribution with a large value hance the resolution of goal-oriented tasks. While these tech-
and gradually annealing it to 0 during training, as expressed niques were initially developed in the context of rocket landing
by the equation: control, their underlying principles are broadly applicable to
H ∼ U (0, H̄β(·)), (2) a diverse range of tasks.
1) Cascading jump start: In rocket landing control, the
where β(·) denotes an annealing factor transitioning from 1 baseline controller, which serves as the guide policy, fails
to 0. RAJS effectively mitigates distribution mismatch, as the in 92% of cases. Consequently, the exploration policy may
exploration policy directly engages with the underlying goal- face a dilemma of hard exploration or unrecoverable failure,
oriented environment after the annealing of H̄ to 0. Further- depending on the shorter or longer guide horizon length.
more, RAJS significantly reduces distribution shift compared Although RAJS significantly reduces the exploration space for
to JSRL-Curriculum. In the context of D∞ , assuming µ and RL agents, the initial training phase remains challenging to
ρ represent d˜init before and after a distribution shift, RAJS explore. With respect to the PPO agent, it may suffer from pre-
exhibits a substantial overlap between µ and ρ, resulting in a mature entropy collapse, necessitating careful selection of the
much smaller D∞ in comparison to JSRL-Random. entropy regulation coefficient to maintain agent exploration.
Due to the minimal distribution shift during training, the This complicates further performance tuning, as additional
tuning of β(·) can be simplified. We propose a schedule based difficulty imposed on the agent (e.g., action smoothness re-
on the moving average of training metrics. For goal-oriented quirement) can impede policy convergence.
In this scenario, an effective technique involves incorpo- kinematics rules are employed to provide coarse information
rating cascading jump start, where an agent is trained under about feasibility. An additional early termination condition is
the vanilla setting and subsequently used as the new guide derived for situations where the task is inevitably bound to
policy πg′ in experiments with more demanding settings. Ex- fail:
periments demonstrate that this technique facilitates simplified 2 2
vy −vsw + vsw2
exploration at the outset of training, with no significant impact 2amax,1 2amax,2 vy ≤ vsw ,
ymin = v 2 (5)
on the final performance after convergence. y , vsw < vy < 0,
2amax,2
2) Reward design: The technical relevance of the outcome
of a goal-oriented problem lies solely in the goal set, specif- where amax,1 and amax,2 represent the approximate values
ically the satisfaction of terminal state constraints associated of maximum deceleration in two control stages. The episode
with the set. As formulated in the preliminary section, the is terminated with zero terminal reward once y ≤ ymin ,
precisely equivalent terminal reward signal would be a binary signifying the impossibility of a proper landing even with
signal indicating constraint satisfaction. However, the absence maximum deceleration.
of a smooth reward gradient causes the policy’s exploration 4) Action smoothing: In dealing with high-fidelity plants, it
to be purely driven by “luck”, leading to a deterioration in is common for them to account for actuator delay to a certain
training efficiency and performance. extent. However, modeling the response perfectly, especially
In an effort to relax the reward, several rules must be the transient response under oscillating control signals, is
followed to ensure that the new objective stays close to the impractical. Therefore, for practical actuators, achieving a
original definition: smooth operating curve is desirable to minimize the disparity
1) All intermediate steps should have zero reward. Non- between simulation and reality. In this context, RL policies,
zero intermediate rewards are prone to unexpected policy supported by powerful neural networks, are typically less
exploitation behavior and tend to significantly deviate effective than classical control methods, which is attributed to
the task from the original definition. the high nonlinearity and the absence of intrinsic motivation
2) Terminal reward should be non-negative. Easily obtained for smoothness. The challenge becomes more pronounced
negative rewards would drive the policy to extend inter- when addressing goal-oriented tasks, where the requirement of
mediate steps and avoid reaching terminal states due to zero intermediate reward hinders direct smoothness regulation.
γ-discounting. To address these challenges, two measures are implemented
to improve action smoothness in goal-oriented tasks. Firstly,
Adhering to these requirements, we can provide smooth
we redefine the action ã as the actuator increment, incorpo-
rewards in Sprox , where Sgoal ⊂ Sprox ⊆ Sterm , to guide
rating the original action a into the state observation s̃:
trajectories that terminate in the proximity of the goal set in the
correct optimization direction. In rocket landing control, Sprox
s̃ = s a
can be chosen as all landing states {s | y = 0}, while other ã = ∆a (6)
conditions of failure, including fuel exhaustion and vertical
speed reversal, still receives zero terminal reward. In Sprox , a′ = clip(a + k∆a, −1, 1)
we propose a logarithmic function: where k is a scaling factor. Secondly, instead of relying on a
regulatory reward, we intervene in the learning process at the
rprox = max(b − log(1 + p max e), 0), (4)
loss level. This is achieved by adding a term to PPO’s policy
where p and b are the scale and bias terms controlling the loss (1) alongside the advantage:
effective range of terminal states to receive a positive reward. X
Each element of e corresponds to the normalized absolute J˜π (θ) = Jπ (θ) + ϵ ∥ã∥2 , (7)
terminal error of a constrained state: where ϵ is a small positive coefficient. While these two
sT − starget measures significantly reduce oscillation, they also introduce
e= .
srange increased learning difficulty. Through cascading jump start,
Compared to the typical quadratic reward form, this ensures these measures are only applied in the second training stage
a non-zero gradient even if the terminal state is far from with a strong guide policy, thereby alleviating the associated
the target, as well as a larger gradient close to the target to difficulties.
facilitate constraint satisfaction among noisy samples. IV. E XPERIMENT
3) Terminal condition: In addition to the primary terminal
condition, the application of early termination based on heuris- A. Environment configuration
tics proves beneficial for credit assignment in the absence As detailed in the problem statement, the high-fidelity rocket
of intermediate rewards, leading to an acceleration in policy plant and its baseline controller were modeled using Simulink.
learning. In the rocket landing task, learning vertical speed To enable interaction with the RL policy, we wrapped the
control from scratch is time-consuming due to a long control system to comply with the standard RL environment interface,
horizon, difficulties in credit assignment, the coupling of y and as is shown in Fig. 2. RL training demands a significant
vy , and the tight bound of vy when y reaches 0. To address this, number of samples for convergence. However, Simulink’s
interpreted execution is not optimal for efficient and parallel TABLE II
environment interaction. To address this, we utilized Simulink H YPERPARAMETERS
Embedded Coder to generate C code, compiling it into an
Algorithm Parameter Value
efficient native module. The use of GOPS Slxpy [23] facil-
itated automated glue code generation, producing a cross- Learning rate 3 × 10−4
Network size (256, 256)
platform binary with deterministic simulation and flexible Network activation tanh
parameter tunability. Notably, the control signal supplied to Discount factor γ 0.995
the plant can dynamically transition between external action GAE λ 0.97
Shared
Train batch size 20 000
and the baseline controller through a parameter, streamlining Gradient steps 30
integration with RAJS. Clip parameter ϵ 0.2
Target KL divergence 0.01
Entropy coefficient 0.007
external obs
action Maximum guide horizon H̄ 18
reward
Plant Postprocess PPO-RAJS Success rate threshold Pthresh 0.3
done Annealing step size α 1/1500
info
baseline state
action
Guidance and
control system 1.0
PPO
PPO-Track
0.8 PPO-JSRL
Fig. 2. Wrapped plant for RL training
PPO-RAJS
Baseline controller
Success rate
In terms of reward formulation, all experiments, except 0.6
1.2
1.0
0.8
0.6
0.4
0.2 2000
0.0 1750
0.2 0.4 0.6 0.8 1.0 1500
Environment Step 1e8
1250
y [m]
C. Evaluation result
could potentially challenge practical actuator implementations.
We generate 106 distinct initial conditions to comprehen- Conversely, actions produced by PPO-RAJS-S exhibit consid-
sively evaluate the final performance of the PPO-RAJS-S erably smoother behavior.
policy across a vast initial state space. The statistical analysis Furthermore, we deploy the model on the ZU9E embedded
presented in Table III aligns closely with the training curves. platform and perform hardware-in-loop co-simulation with a
Notably, the component vy exhibits the highest frequency rocket dynamics simulation engine. Results from the simula-
of violations among all constraints, corroborating the earlier tion demonstrate that policy inference aligns within the control
discussion on the challenge posed by controlling vy effectively. interval of 10 ms, validating its real-time applicability.
Figure 5 illustrates that the majority of trajectories either
achieve success or fail proximal to the goal set boundary. V. C ONCLUSIONS
However, a small subset experiences pose instability due This paper presents the random annealing jump start ap-
to aggressive pose control, leading to landing significantly proach, which utilizes baseline controllers to empower RL
distant from the target, highlighting a current limitation of algorithms in tackling complex real-world goal-oriented tasks.
our approach. Given the safety-critical nature of rocket landing control, our
The enhanced smoothness of actions is depicted in Figure 6. future research will delve into integrating safe RL theory, such
While both PPO-RAJS and PPO-RAJS-S successfully reach as neural barrier certificate [24], to manage state constraints
the goal set, the former displays notable fluctuations, which more effectively. This integration holds promise in addressing
1 [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
PPO-RAJS
Thrust imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
0 PPO-RAJS-S
2017.
[6] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
−1 policy maximum entropy deep reinforcement learning with a stochastic
0 5 10 15 20 25
actor,” in Proceedings of the 35th International Conference on Machine
1 Learning, ser. Proceedings of Machine Learning Research, J. Dy and
A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870.
Attitude 1
0
supervised auxiliary tasks,” in International Conference on Learning
Representations, 2017.
−1
[9] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven
0 5 10 15 20 25 exploration by self-supervised prediction,” in International conference
on machine learning. PMLR, 2017, pp. 2778–2787.
1 [10] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill, “Learning
Attitude 3