0% found this document useful (0 votes)

22 views8 pages

Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning

Uploaded by

Clark Ren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views8 pages

Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning

Uploaded by

Clark Ren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Rocket Landing Control with Random Annealing Jump Start

Reinforcement Learning
Yuxuan Jiang2 , Yujie Yang2 , Zhiqian Lan2 , Guojian Zhan2 , Shengbo Eben Li*1,2 ,
Qi Sun2 , Jian Ma3 , Tianwen Yu3 , Changwu Zhang3

Abstract—Rocket recycling is a crucial pursuit in aerospace

technology, aimed at reducing costs and environmental impact
in space exploration. The primary focus centers on rocket landing
control, involving the guidance of a nonlinear underactuated Vacuum
rocket with limited fuel in real-time. This challenging task
Entry burn
prompts the application of reinforcement learning (RL), yet goal-
arXiv:2407.15083v1 [cs.LG] 21 Jul 2024

oriented nature of the problem poses difficulties for standard RL Stage

algorithms due to the absence of intermediate reward signals. separation Endo-atmospheric
This paper, for the first time, significantly elevates the success
rate of rocket landing control from 8% with a baseline controller
to 97% on a high-fidelity rocket model using RL. Our approach, Lower atmospheric
called Random Annealing Jump Start (RAJS), is tailored for Vertical landing
real-world goal-oriented problems by leveraging prior feedback Engine off
controllers as guide policy to facilitate environmental exploration Landing site
and policy learning in RL. In each episode, the guide policy Engine on

navigates the environment for the guide horizon, followed by the Fig. 1. Brief task demonstration of rocket landing control
exploration policy taking charge to complete remaining steps.
This jump-start strategy prunes exploration space, rendering the cumulative rewards. It has shown remarkable success in var-
problem more tractable to RL algorithms. The guide horizon ious domains, including video games [1], board games [2],
is sampled from a uniform distribution, with its upper bound robotics [3], and autonomous driving [4]. In these problems,
annealing to zero based on performance metrics, mitigating well-crafted dense reward signals are critical for RL agents to
distribution shift and mismatch issues in existing methods.
Additional enhancements, including cascading jump start, refined progressively improve their policies. However, rocket landing
reward and terminal condition, and action smoothness regulation, poses a unique challenge for RL. It is a goal-oriented problem,
further improve policy performance and practical applicability. requiring the agent to reach a specific goal set in the state
The proposed method is validated through extensive evaluation space, with intermediate reward signals not readily available.
and Hardware-in-the-Loop testing, affirming the effectiveness, Standard RL algorithms, such as PPO [5] , SAC [6] and
real-time feasibility, and smoothness of the proposed controller.
DSAC [7], fail in such scenarios due to their reliance on
I. I NTRODUCTION extensive exploration; the likelihood of randomly reaching the
goal diminishes exponentially over time.
Rocket recycling is a pivotal pursuit in the field of aerospace
General goal-oriented tasks without any prior knowledge
technology, driven by its potential to significantly reduce costs
present intrinsic difficulties. Several algorithm categories, such
and mitigate the environmental impact of space exploration.
as intrinsic reward methods [8], [9] and efficient exploration
Central to this endeavor is the challenge of rocket landing
methods [10], [11] have been developed to optimize policies in
control, a task that involves guiding a nonlinear underactuated
such contexts. However, they still face significant challenges
rocket plant with limited fuel to land in real-time. The con-
in scenarios with hard-to-reach goals and large continuous
troller must overcome large random disturbances, and satisfy
action spaces, requiring an exponential increase in sample
strict terminal state requirement at landing. Conventional con-
complexity. In many cases, some prior knowledge of the
trol strategies, such as PID controllers, require tedious tuning
target environment does exist, and leveraging this knowledge
to meet the accuracy prerequisites, while optimization methods
can accelerate learning and enhance performance. Reward
face difficulties in meeting the real-time constraints of this
shaping is a common method for integrating such knowl-
task. Reinforcement learning (RL), employing offline training
edge. Although manually designed rewards are adaptable
and online implementation (OTOI) model, presents a viable
and straightforward to implement, they necessitate substantial
solution for addressing these complex challenges.
effort to balance multiple factors. In OpenAI Five [12], for
RL offers a powerful paradigm to iteratively optimize
instance, over twenty shaped reward components are carefully
policies for control problems by maximizing the expected
weighted to achieve professional-level performance. Another
This study is supported by Tsinghua University Initiative Scientific Re- approach, prevalent in robotic control tasks, involves using
search Program, and NSF China with U20A20334 and 52072213. a reference trajectory with a quadratic tracking reward. For
1 College of Artificial Intelligence, Tsinghua University, Beijing, 100084,
rocket landing control, designing feasible trajectories demands
China. 2 School of Vehicle and Mobility, Tsinghua University, Beijing, 100084,
China. 3 LandSpace Technology Corporation, Beijing, 100176, China. * All expert knowledge [13], and a single fixed trajectory becomes
correspondences should be sent to S. E. Li with e-mail: [email protected] impractical under variable initial conditions and environmental
disturbances. Alternatively, some methods utilize more specific TABLE I
prior knowledge, like offline datasets or prior policies, to M AIN STATES , INITIAL VALUES AND TERMINAL CONSTRAINTS
aid exploration and learning. This class of methods typically
State Unit Initial value & range Terminal value & range
target initialization, either initializing the policy before RL
training [14], [15] or initializing the state before each episode x m 50 ± 500 0±5
y m 2000 ± 10 0
[16], [17]. One simple and effective approach is jump start z m 0 ± 500 0±5
RL (JSRL) [18], which initialize each episode by applying vx m/s −10 ± 50 0±1
a guide policy over the guide horizon before transitioning to vy m/s −300 ± 50 −1 ∼ 0
vz m/s 0 ± 50 0±1
the exploration RL policy for the remainder of the task. The ϕ ◦ 90 ± 0.5 90 ± 3
effectiveness of the jump start technique hinges on the design ψ ◦ 0 ± 0.5 0±3
γ ◦ 0 ± 0.5 -
of the guide horizon, with two variants, JSRL-Curriculum and ◦ /s
ϕ̇ 0 ± 0.1 0 ± 1.5
JSRL-Random, proposed in JSRL paper. ◦ /s
ψ̇ 0 ± 0.1 0 ± 1.5
In this paper, we build on the jump start framework, and γ̇ ◦ /s 0 ± 0.1 -
introduce the Random Annealing Jump Start (RAJS) approach. m kg 50000 ± 500 ≥ 40000
RAJS is particularly suitable for real-world control problems
like rocket landing, which often involve some form of prior control. The task switches to the second stage when the vertical
feedback controllers based on PID or other classical control velocity vy reaches a threshold of vsw = −60 m/s, where
methods. For each episode, RAJS samples the guide horizon the internal low-level control mechanism closes two of three
uniformly between zero and an upper bound, and anneals engines, and continue the mission until landing. This internal
the upper bound to zero during training. This minimizes switch does not affect the controller-rocket interaction directly,
state distribution shifts, addressing the limitations of JSRL- but would be reflected in the model dynamics.
Curriculum and broadening the applicability of RL algorithms The key states of the rocket plant include position x, y, z,
from offline, off-policy to on-policy variants. Additionally, velocity vx , vy , vz , angular position ϕ, ψ, γ, angular velocity
the final initial state distribution aligns with the underlying ϕ̇, ψ̇, γ̇, and total mass m, which dynamically reduces as
environment, countering the distribution mismatch issue seen fuel is consumed. Each simulation run allows these state
in JSRL-Random. The stability afforded by RAJS simplifies variables to initialize within specified ranges, as shown in
the annealing schedule, allowing for either automatic schedul- Table I. The constraints for the terminal state are defined to
ing based on training metric feedback or manual adjustment ensure a precise landing within a narrow vicinity of the target,
using a clamped ramp function. We also integrate several along with parameters critical for maintaining landing stability,
generally applicable practical techniques, including cascading also delineated in Table I. The state observation includes 12
jump start, refined reward and terminal condition design, kinematic variables (encompassing position, velocity, angular
and action smoothness regulation. These enhancements enable position, and angular velocity) and the axial load. It is im-
us to effectively tackle the rocket landing control problem, portant to note that certain plant states, such as those related
elevating the success rate from 8% with the baseline controller to engine response, remain hidden from the controller. The
to an impressive 97%. Extensive evaluation reveals that the control inputs are modeled as three engine attitude signals
control maneuvers are both smooth and interpretable, and and one engine thrust signal, normalized within −1 to 1.
Hardware-in-the-Loop testing confirms the real-time feasibility The simulation terminates upon the occurrence of one of the
of the control signals produced by our controller. following events: either a successful or failed landing, fuel
exhaustion, and vertical speed reversal.
II. P RELIMINARY
Additional complexity is introduced in the form of wind
A. Problem Statement disturbance, characterized by the wind speed that is uniformly
Illustrated in Fig. 1, our study focuses on the critical phase sampled within a range of 0 m/s to 15 m/s, and the wind
of vertical landing within the lower atmospheric layer, which direction from any direction within the horizontal plane.
presents the highest demand for control precision in rocket Crucially, these wind parameters remain unobservable to the
recycling missions. In the context of controller operations, the controller, necessitating a control algorithm that possesses
fundamental objective of the rocket landing task is to guide the robustness to such external perturbations.
rocket from diverse initial states to land on the ground, while A baseline controller is integrated within the plant for estab-
satisfying various terminal state constraints. The plant is a lishing a closed-loop simulation environment. This controller,
high-fidelity rocket model built in Simulink from LandSpace, based on several reference trajectories and PID control mech-
containing calibrated subsystems for inertia, engine, actuator, anisms, demonstrates the capability to maintain pose stability
aerodynamics and disturbance. The coordinate system used for and manage the vertical descent under normal conditions.
the rocket landing problem has its origin on the ground. The However, its limited adaptability in the face of disturbances
y-axis points vertically upward, the x-axis points toward the results in frequent constraint violations and a modest overall
north, and the z-axis points toward the east. success rate of 8%. Hence, although the baseline controller of-
The task consists of two stages. On the first stage, all three fers valuable reference for policy learning, the algorithm must
engines of the rocket are available for deceleration and pose refrain from mere imitation and instead employ it strategically.
B. Goal-oriented Markov decision process III. M ETHOD
Reinforcement Learning (RL) is based on the concept of A. Jump start framework
a Markov Decision Process (MDP), where the optimal action
The jump start framework comprises a fixed guide policy
solely depends on the current state [19]. To elaborate further,
πg (a|s) and a learnable exploration policy πe (a|s). In each
at each time step denoted as t, the agent selects an action
episode, the guide policy first navigates the environment for
at ∈ A in response to the current state st ∈ S and the policy
the guide horizon H steps, after which the exploration policy
π : S → A. Subsequently, the environment transits to the
πe takes over to complete the remaining steps. Selecting H
next state according to its dynamics, represented as st+1 =
close to the full task horizon reduces the exploration space of
f (st , at ), and provides a scalar reward signal denoted as rt .
πe , enhancing the likelihood of reaching the goal.
The value function v π : S → R is defined as the expected
The effectiveness of the jump start framework hinges on the
cumulative sum of rewards generated by following policy π
design of the guide horizon H. In essence, jump start modifies
when initiated from an initial state s.
the initial state distribution dinit of the original MDP to an
In the context of rocket landing control, we can abstractly
easier d˜init produced by the guide policy. d˜init depends solely
model it as a goal-oriented MDP. The agent is required to reach
on the guide horizon H, which can be a constant or a random
the goal set Sgoal from the initial set Sinit while maximizing
variable. The gap between d˜init during training and dinit during
discounted return defined as:
" T # practical evaluation can negatively impact policy performance,
max Eτ ∼π
X
t
γ r(st ) , necessitating careful design of H. The distribution mismatch
π
t=0
coefficient (Definition 1) is a useful metric to quantify such
gap.
where γ is the discount factor, and the goal-oriented reward
function is expressed as: Definition 1 (Distribution mismatch coefficient [21]). The
distribution mismatch coefficient D∞ :
(
1 st ∈ Sgoal
r(st ) = . dπρ
0 st ∈
/ Sgoal D∞ =
An episode terminates when state st enters Sterm = Sgoal ∪Sfail , µ ∞
where Sfail represents a collection of states disjoint with Sgoal quantifies the effect of policy gradient learning under a initial
where continuation of the simulation is infeasible. state distribution µ possibly deviated from the initial state
C. Proximal policy optimization distribution of interest ρ. dπρ refer to the discounted stationary
state distribution under policy π:
Proximal policy optimization (PPO) [5] is an on-policy
∞
" #
RL algorithm rooted in the policy gradient method, featur- X
ing straightforward implementation, parallel sampling, and dπρ = Es0 ∼ρ (1 − γ) γ t Prπ (st = s|s0 ) .
stability during training. PPO finds extensive application in t=0

addressing complex challenges, such as high-dimensional con- The JSRL paper introduces two variants: JSRL-Curriculum
tinuous control tasks, the development of advanced AI for and JSRL-Random. JSRL-Curriculum dynamically adjusts H
professional-level gaming, and the recent advancements in as a function of training progress using a pre-designed cur-
reinforcement learning through human feedback for refining riculum. In practical terms, a step function is employed:
large language models. The core of PPO lies in its objective
k
function, as expressed by: Hk = 1 − H̄, k = 0, 1, . . . , n,
n
1 X
Jπ (θ) = min (ρ(θ)Aπold (x, u),
|D| (1) where H̄ is the initial guide horizon, n is the number of
x,u∈D
curriculum stages, and k is incremented based on the moving
clip (ρ(θ), 1 − ϵ, 1 + ϵ) Aπold (x, u)) . average of evaluated performance. The modified initial state
Here, θ denotes the parameter of the policy network, ρ(θ) = distribution d˜init,k associated with Hk ensures that d˜init,n is
πθ (u|x) πold
πold (u|x) represents the importance sampling factor, and A is equivalent to dinit , benefiting evaluation performance. How-
the advantage function computed using generalized advantage ever, the transition between stages introduces a notable flaw
estimation [20]. PPO also incorporates the learning of a state of distribution shift, adding unseen states to the space and
value network to estimate vπ , which serves as a baseline changing the initial state distribution at each stage switch.
in generalized advantage estimation, contributing to variance JSRL paper focuses on transitioning from offline RL to
reduction. While PPO excels in exploration capabilities, it en- online RL, employing implicit Q Learning (IQL) [22] as the
counters significant challenges when addressing goal-oriented optimization algorithm in both stages. While IQL mitigates
environments independently, primarily due to the exponential the distribution shift to some extent through its replay buffer,
sample complexity inherent in such scenarios. To overcome on-policy algorithms like PPO face significant challenges in
this obstacle effectively, it is necessary to combine PPO with policy updates due to this shift. Particularly in tasks such as
our proposed RAJS method, thus facilitating the efficient rocket landing control, where initial state distributions d˜init for
attainment of satisfactory solutions. different guide horizons H can be completely disjoint, this
would disrupt policy learning, rendering the current policy tasks, the proportion of episodes terminating in the goal set (or
unlearned. success rate), denoted as Pgoal , serves as a suitable metric. The
On the other hand, JSRL-Random samples the guide horizon update rule for β(Pgoal ) at the end of each training iteration
H from a uniform distribution U (0, H̄), where H̄ is the is as follows:
horizon upper bound. Unlike JSRL-Curriculum, the initial
β ← max(β − αI(Pgoal ≥ Pthresh ), 0), (3)
state distribution induced by the random variable H remains
fixed throughout training, essentially creating an alternative where Pthresh a tunable performance threshold, I(·) is the
stationary MDP with an adjusted initial state distribution. This indicator function, and α is the update step size. Additionally,
approach enhances stability during training compared to JSRL- due to improved training stability, designing a ramp schedule
Curriculum but introduces a different challenge of distribution β(N ) manually becomes trivial, with N representing the total
mismatch. Although it is well-established that the optimal number of environment interactions. The start and end steps
policy π ∗ for an MDP is independent of specific dinit under of the ramp can be determined by solving the task once with
mild conditions, different dinit do affect the iteration complex- β ≡ 1 and observing the success rate training curve.
ity, particularly when function approximation is involved. As RAJS relaxes JSRL’s guide policy requirements, extending
proven by [21], the suboptimality V ∗ (s) − V π (s) of policy its applicability to on-policy RL algorithms, as outlined in
gradient after T iterations is positively correlated to D∞ ϵapprox . Algorithm 1. In this paper, our focus is on PPO discussed in
Here, ϵapprox is the approximation error, representing the min- the preliminary section, leveraging its general applicability to
imal possible error for the given parametric function to fit the address the complex challenge of rocket landing control.
target distribution. This correlation implies that, for a fixed
positive ϵapprox of a neural network function approximation, Algorithm 1 Random annealing jump start w/ on-policy RL
a larger distribution mismatch coefficient D∞ contributes to 1: Input: guide policy πg , maximum guide horizon H̄, metric
slower convergence. In the context of solving goal-oriented threshold Pthresh , annealing step size α, training batch size B.
2: Initialize exploration-policy πe , and other required function ap-
problems with jump start, ρ corresponds to dinit , and µ
proximation, e.g., state value function V . Initialize annealing
corresponds to d˜init of the chosen jump start schedule. Within factor β ← 1, and moving mean metric P ← 0.
the support of dinit , JSRL-Random’s d˜init is much smaller than 3: procedure ROLLOUT(πg , πe , H̄, D, P )
dinit , leading to a larger D∞ . This explains the experimental 4: Sample initial state s0 ∼ dinit , guide horizon H ∼ U (0, H̄).
observation that JSRL-Random’s policy performance on the 5: Rollout ⌊H⌋ steps with guide policy πg .
6: Rollout until termination with exploration policy πe , logging
evaluation distribution is significantly inferior to optimality.
trajectory {(s, a, r), . . .} to D.
7: Update P with metric of current episode.
B. Random annealing jump start 8: end procedure
Building upon the insights gained from the analysis pre- 9: repeat
sented earlier, and considering the effectiveness of the jump 10: Initialize trajectory dataset D = {}.
11: Sample with ROLLOUT(πg , πe , H̄β, D, P ), until |D| ≥ B.
start framework in challenging goal-oriented environments 12: πe , V ← T RAIN P OLICY(πe , V, D).
when used with on-policy algorithms, we introduce Random 13: β ← max(β − αI(P ≥ Pthresh ), 0).
Annealing Jump Start (RAJS). This approach addresses the 14: until β = 0 and convergence
limitations of distribution shift and distribution mismatch
observed in prior works. RAJS achieves this by sampling the
guide horizon from a uniform distribution, similar to JSRL- C. Practical techniques
Random. However, a key distinction lies in initializing the Several practical techniques are introduced to further en-
upper bound of the uniform distribution with a large value hance the resolution of goal-oriented tasks. While these tech-
and gradually annealing it to 0 during training, as expressed niques were initially developed in the context of rocket landing
by the equation: control, their underlying principles are broadly applicable to
H ∼ U (0, H̄β(·)), (2) a diverse range of tasks.
1) Cascading jump start: In rocket landing control, the
where β(·) denotes an annealing factor transitioning from 1 baseline controller, which serves as the guide policy, fails
to 0. RAJS effectively mitigates distribution mismatch, as the in 92% of cases. Consequently, the exploration policy may
exploration policy directly engages with the underlying goal- face a dilemma of hard exploration or unrecoverable failure,
oriented environment after the annealing of H̄ to 0. Further- depending on the shorter or longer guide horizon length.
more, RAJS significantly reduces distribution shift compared Although RAJS significantly reduces the exploration space for
to JSRL-Curriculum. In the context of D∞ , assuming µ and RL agents, the initial training phase remains challenging to
ρ represent d˜init before and after a distribution shift, RAJS explore. With respect to the PPO agent, it may suffer from pre-
exhibits a substantial overlap between µ and ρ, resulting in a mature entropy collapse, necessitating careful selection of the
much smaller D∞ in comparison to JSRL-Random. entropy regulation coefficient to maintain agent exploration.
Due to the minimal distribution shift during training, the This complicates further performance tuning, as additional
tuning of β(·) can be simplified. We propose a schedule based difficulty imposed on the agent (e.g., action smoothness re-
on the moving average of training metrics. For goal-oriented quirement) can impede policy convergence.
In this scenario, an effective technique involves incorpo- kinematics rules are employed to provide coarse information
rating cascading jump start, where an agent is trained under about feasibility. An additional early termination condition is
the vanilla setting and subsequently used as the new guide derived for situations where the task is inevitably bound to
policy πg′ in experiments with more demanding settings. Ex- fail:
periments demonstrate that this technique facilitates simplified  2 2
 vy −vsw + vsw2

exploration at the outset of training, with no significant impact 2amax,1 2amax,2 vy ≤ vsw ,
ymin = v 2 (5)
on the final performance after convergence.  y , vsw < vy < 0,
2amax,2
2) Reward design: The technical relevance of the outcome
of a goal-oriented problem lies solely in the goal set, specif- where amax,1 and amax,2 represent the approximate values
ically the satisfaction of terminal state constraints associated of maximum deceleration in two control stages. The episode
with the set. As formulated in the preliminary section, the is terminated with zero terminal reward once y ≤ ymin ,
precisely equivalent terminal reward signal would be a binary signifying the impossibility of a proper landing even with
signal indicating constraint satisfaction. However, the absence maximum deceleration.
of a smooth reward gradient causes the policy’s exploration 4) Action smoothing: In dealing with high-fidelity plants, it
to be purely driven by “luck”, leading to a deterioration in is common for them to account for actuator delay to a certain
training efficiency and performance. extent. However, modeling the response perfectly, especially
In an effort to relax the reward, several rules must be the transient response under oscillating control signals, is
followed to ensure that the new objective stays close to the impractical. Therefore, for practical actuators, achieving a
original definition: smooth operating curve is desirable to minimize the disparity
1) All intermediate steps should have zero reward. Non- between simulation and reality. In this context, RL policies,
zero intermediate rewards are prone to unexpected policy supported by powerful neural networks, are typically less
exploitation behavior and tend to significantly deviate effective than classical control methods, which is attributed to
the task from the original definition. the high nonlinearity and the absence of intrinsic motivation
2) Terminal reward should be non-negative. Easily obtained for smoothness. The challenge becomes more pronounced
negative rewards would drive the policy to extend inter- when addressing goal-oriented tasks, where the requirement of
mediate steps and avoid reaching terminal states due to zero intermediate reward hinders direct smoothness regulation.
γ-discounting. To address these challenges, two measures are implemented
to improve action smoothness in goal-oriented tasks. Firstly,
Adhering to these requirements, we can provide smooth
we redefine the action ã as the actuator increment, incorpo-
rewards in Sprox , where Sgoal ⊂ Sprox ⊆ Sterm , to guide
rating the original action a into the state observation s̃:
trajectories that terminate in the proximity of the goal set in the
correct optimization direction. In rocket landing control, Sprox

s̃ = s a
can be chosen as all landing states {s | y = 0}, while other ã = ∆a (6)
conditions of failure, including fuel exhaustion and vertical
speed reversal, still receives zero terminal reward. In Sprox , a′ = clip(a + k∆a, −1, 1)
we propose a logarithmic function: where k is a scaling factor. Secondly, instead of relying on a
regulatory reward, we intervene in the learning process at the
rprox = max(b − log(1 + p max e), 0), (4)
loss level. This is achieved by adding a term to PPO’s policy
where p and b are the scale and bias terms controlling the loss (1) alongside the advantage:
effective range of terminal states to receive a positive reward. X
Each element of e corresponds to the normalized absolute J˜π (θ) = Jπ (θ) + ϵ ∥ã∥2 , (7)
terminal error of a constrained state: where ϵ is a small positive coefficient. While these two
sT − starget measures significantly reduce oscillation, they also introduce
e= .
srange increased learning difficulty. Through cascading jump start,
Compared to the typical quadratic reward form, this ensures these measures are only applied in the second training stage
a non-zero gradient even if the terminal state is far from with a strong guide policy, thereby alleviating the associated
the target, as well as a larger gradient close to the target to difficulties.
facilitate constraint satisfaction among noisy samples. IV. E XPERIMENT
3) Terminal condition: In addition to the primary terminal
condition, the application of early termination based on heuris- A. Environment configuration
tics proves beneficial for credit assignment in the absence As detailed in the problem statement, the high-fidelity rocket
of intermediate rewards, leading to an acceleration in policy plant and its baseline controller were modeled using Simulink.
learning. In the rocket landing task, learning vertical speed To enable interaction with the RL policy, we wrapped the
control from scratch is time-consuming due to a long control system to comply with the standard RL environment interface,
horizon, difficulties in credit assignment, the coupling of y and as is shown in Fig. 2. RL training demands a significant
vy , and the tight bound of vy when y reaches 0. To address this, number of samples for convergence. However, Simulink’s
interpreted execution is not optimal for efficient and parallel TABLE II
environment interaction. To address this, we utilized Simulink H YPERPARAMETERS
Embedded Coder to generate C code, compiling it into an
Algorithm Parameter Value
efficient native module. The use of GOPS Slxpy [23] facil-
itated automated glue code generation, producing a cross- Learning rate 3 × 10−4
Network size (256, 256)
platform binary with deterministic simulation and flexible Network activation tanh
parameter tunability. Notably, the control signal supplied to Discount factor γ 0.995
the plant can dynamically transition between external action GAE λ 0.97
Shared
Train batch size 20 000
and the baseline controller through a parameter, streamlining Gradient steps 30
integration with RAJS. Clip parameter ϵ 0.2
Target KL divergence 0.01
Entropy coefficient 0.007
external obs
action Maximum guide horizon H̄ 18
reward
Plant Postprocess PPO-RAJS Success rate threshold Pthresh 0.3
done Annealing step size α 1/1500
info
baseline state
action
Guidance and
control system 1.0
PPO
PPO-Track
0.8 PPO-JSRL
Fig. 2. Wrapped plant for RL training
PPO-RAJS
Baseline controller

Success rate
In terms of reward formulation, all experiments, except 0.6

for the trajectory tracking baseline mentioned later, share an

intermediate reward of zero and a terminal reward defined as 0.4
follows: (
rprox sT ∈ Sprox , 0.2
rT =
0 sT ∈
/ Sprox .
0.0
Here, rprox is defined according to (4), with p = 0.1 and
0.2 0.4 0.6 0.8 1.0
b = 3.5. For the extra early termination condition, amax,1 Environment Step 1e8
and amax,2 are set to 40 m/s2 and 8 m/s2 , respectively, based
on preliminary testing with open-loop signals.
Fig. 3. Success rate for PPO-RAJS and baselines. The solid lines correspond
B. Benchmark experiment to the mean and the shaded regions correspond to 95% confidence interval
over three runs.
In the experiments, we combine RAJS with the on-policy
PPO algorithm, referred to as PPO-RAJS, and compare its
performance against several baselines, all utilizing PPO as the
underlying RL algorithm: Fig. 3 illustrate the following results. Our algorithm PPO-
1) PPO: This baseline applies the PPO to solve the goal- RAJS achieves a high success rate and a small variance across
oriented environment without modification. different seeds at convergence. PPO fails to optimize the
2) PPO-Track: The environment definition is adjusted to policy due to the sparse reward signals. PPO-Track has poor
track predefined trajectories of the baseline controller performance because of insufficient adaptiveness to different
while incorporating a shaped tracking reward. initial condition and disturbance. PPO-JSRL exhibits similar
3) PPO-JSRL: This variant integrates PPO with JSRL- trend compared to PPO-RAJS during the initial rise, but cannot
Random, where the baseline controller serves as the further improve performance for the rest of the run due to
guide policy. We excluded JSRL-Curriculum from the distribution mismatch issues.
benchmark as it is incompatible with on-policy RL. The trained policy from PPO-RAJS is then employed as the
We provide the hyperparameters of these algorithms in new guide policy, following the principles of the cascading
Table II. To mitigate the influence of randomness, we conduct jump start technique, to learn the smoothness-focused PPO-
three experiments for each algorithm using different seeds. RAJS-S policy. This policy incorporates equations (6) and
The combination of an efficient native environment module (7) with a smoothness regulation coefficient ϵ = 0.01 and
and PPO’s capability for parallel sampling allows for rapid an incremental action scaling factor k = 0.4 for all action
training, achieving 108 steps in 5 hours. components. Fig. 4a illustrates that PPO-RAJS-S achieves
We evaluate the performance of these algorithms based faster learning and further improves the success rate, even
on their success rate, defined as the fraction of trajectories with the additional challenge of action smoothness regulation.
that reach the goal set. The learning curves depicted in To quantify the effect of action smoothness, we introduce the
1.0 TABLE III
PPO-RAJS F INAL POLICY EVALUATION STATISTICS
0.8 PPO-RAJS-S
Success rate 0.9739
Success rate

0.6 Landing rate 0.9953

(Within non-landing trials)
0.4 Vertical speed reversal 0.0043
Fuel exhaustion 0.0003
0.2 (Within landing trials) Satisfaction rate Error 99th percentile
x 0.9930 4.5849
0.0 z 0.9944 4.2008
vx 0.9934 0.8599
0.2 0.4 0.6 0.8 1.0
1e8
vy 0.9831 1.0869
Environment Step
vz 0.9939 0.8338
(a) Success rate ϕ 0.9949 1.9780
ψ 0.9957 1.3992
1.6 ϕ̇ 0.9955 0.6678
1.4 ψ̇ 0.9955 0.6511
Second order fluctuation

1.2

1.0

0.8

0.6

0.4

0.2 2000
0.0 1750
0.2 0.4 0.6 0.8 1.0 1500
Environment Step 1e8
1250
y [m]

(b) Action fluctuation (lower is better)

1000
Fig. 4. Training curve comparison between PPO-RAJS and PPO-RAJS- 750
Smooth. 500
250
0
second-order fluctuation F2 , defined as: −600 −600
−400 −400
T −1
1 X −200 −200
F2 = |at + at−2 − 2at−1 |. 0 0
T − 2 t=2 z [m] x [m]
200 200
The training curve presented in Fig. 4b clearly indicates the 400 400
advantage of incremental action from the outset, with further 600
improvements in action smoothness observed during training
Fig. 5. A subset of 100 trajectories controlled by PPO-RAJS-S policy.
through the simple regulation measure.

C. Evaluation result
could potentially challenge practical actuator implementations.
We generate 106 distinct initial conditions to comprehen- Conversely, actions produced by PPO-RAJS-S exhibit consid-
sively evaluate the final performance of the PPO-RAJS-S erably smoother behavior.
policy across a vast initial state space. The statistical analysis Furthermore, we deploy the model on the ZU9E embedded
presented in Table III aligns closely with the training curves. platform and perform hardware-in-loop co-simulation with a
Notably, the component vy exhibits the highest frequency rocket dynamics simulation engine. Results from the simula-
of violations among all constraints, corroborating the earlier tion demonstrate that policy inference aligns within the control
discussion on the challenge posed by controlling vy effectively. interval of 10 ms, validating its real-time applicability.
Figure 5 illustrates that the majority of trajectories either
achieve success or fail proximal to the goal set boundary. V. C ONCLUSIONS
However, a small subset experiences pose instability due This paper presents the random annealing jump start ap-
to aggressive pose control, leading to landing significantly proach, which utilizes baseline controllers to empower RL
distant from the target, highlighting a current limitation of algorithms in tackling complex real-world goal-oriented tasks.
our approach. Given the safety-critical nature of rocket landing control, our
The enhanced smoothness of actions is depicted in Figure 6. future research will delve into integrating safe RL theory, such
While both PPO-RAJS and PPO-RAJS-S successfully reach as neural barrier certificate [24], to manage state constraints
the goal set, the former displays notable fluctuations, which more effectively. This integration holds promise in addressing
1 [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
PPO-RAJS
Thrust imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
0 PPO-RAJS-S
2017.
[6] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
−1 policy maximum entropy deep reinforcement learning with a stochastic
0 5 10 15 20 25
actor,” in Proceedings of the 35th International Conference on Machine
1 Learning, ser. Proceedings of Machine Learning Research, J. Dy and
A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870.
Attitude 1

0 [7] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional

soft actor-critic: Off-policy reinforcement learning for addressing value
−1 estimation errors,” IEEE transactions on neural networks and learning
0 5 10 15 20 25 systems, vol. 33, no. 11, pp. 6584–6598, 2021.
1
[8] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo,
D. Silver, and K. Kavukcuoglu, “Reinforcement learning with un-
Attitude 2

0
supervised auxiliary tasks,” in International Conference on Learning
Representations, 2017.
−1
[9] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven
0 5 10 15 20 25 exploration by self-supervised prediction,” in International conference
on machine learning. PMLR, 2017, pp. 2778–2787.
1 [10] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill, “Learning
Attitude 3

near optimal policies with low inherent bellman error,” in International

0 Conference on Machine Learning. PMLR, 2020, pp. 10 978–10 989.
[11] T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai, “Policy finetuning:
−1 Bridging sample-efficient offline and online reinforcement learning,”
0 5 10 15 20 25
Time [s] Advances in neural information processing systems, vol. 34, 2021.
[12] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison,
D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with large
Fig. 6. Action sequence comparison between PPO-RAJS and PPO-RAJS-S,
scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680,
starting from a common initial state.
2019.
[13] K. S. G. Anglim, Minimum-fuel optimal trajectory for reusable first-
stage rocket landing using Particle Swarm Optimization. California
the current challenge of unguaranteed pose stability and in State University, Long Beach, 2016.
further augmenting task success rates. [14] A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating
online reinforcement learning with offline datasets,” arXiv preprint
ACKNOWLEDGMENT arXiv:2006.09359, 2020.
[15] Y. Lu, K. Hausman, Y. Chebotar, M. Yan, E. Jang, A. Herzog, T. Xiao,
We extend our gratitude to our colleagues at LandSpace for A. Irpan, M. Khansari, D. Kalashnikov et al., “Aw-opt: Learning robotic
providing their rocket model, contributing innovative insights, skills with imitation andreinforcement at scale,” in Conference on Robot
and offering consistent support throughout the entirety of this Learning. PMLR, 2022, pp. 1078–1088.
[16] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,
research endeavor, without which this study would not have “Overcoming exploration in reinforcement learning with demonstra-
been possible. LandSpace Technology Co., Ltd. (LandSpace) tions,” in 2018 IEEE international conference on robotics and automa-
was founded in 2015 and is a leading Chinese enterprise in tion (ICRA). IEEE, 2018, pp. 6292–6299.
[17] A. Agarwal, M. Henaff, S. Kakade, and W. Sun, “Pc-pg: Policy cover
the creation and operation of space transportation systems. directed exploration for provable policy gradient learning,” Advances
It is dedicated to establishing a comprehensive industrial in Neural Information Processing Systems, vol. 33, pp. 13 399–13 412,
chain encompassing “research and development, manufactur- 2020.
[18] I. Uchendu, T. Xiao, Y. Lu, B. Zhu, M. Yan, J. Simon, M. Ben-
ing, testing, and launching,” with a focus on medium and large nice, C. Fu, C. Ma, J. Jiao, S. Levine, and K. Hausman, “Jump-
Liquid Oxygen-Methane launch vehicles. LandSpace aims to start reinforcement learning,” in Proceedings of the 40th International
create a technological hub in the space sector and provide Conference on Machine Learning, ser. Proceedings of Machine Learning
Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and
highly cost-effective and reliable space transportation services J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 34 556–34 583.
to the global market. [19] S. E. Li, Reinforcement Learning for Sequential Decision and Optimal
Control. Springer Verlag, Singapore, 2023.
R EFERENCES [20] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
[1] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, dimensional continuous control using generalized advantage estimation,”
K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, and S. Levine, arXiv preprint arXiv:1506.02438, 2015.
“Model-based reinforcement learning for atari,” in International Con- [21] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory
ference on Learning Representations, 2020, Conference Proceedings. of policy gradient methods: Optimality, approximation, and distribution
[2] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, shift,” The Journal of Machine Learning Research, vol. 22, no. 1, pp.
S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, 4431–4506, 2021.
and D. Silver, “Mastering atari, go, chess and shogi by planning with a [22] I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning
learned model,” Nature, vol. 588, no. 7839, pp. 604–609, 2020. with implicit q-learning,” in International Conference on Learning
[3] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, Representations, 2022.
J. Pachocki, A. Petron, M. Plappert, G. Powell, and A. Ray, “Learning [23] W. Wang, Y. Zhang, J. Gao, Y. Jiang, Y. Yang, Z. Zheng, W. Zou, J. Li,
dexterous in-hand manipulation,” The International Journal of Robotics C. Zhang, W. Cao et al., “Gops: A general optimal control problem
Research, vol. 39, no. 1, pp. 3–20, 2020. solver for autonomous driving and industrial control applications,”
[4] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yo- Communications in Transportation Research, vol. 3, p. 100096, 2023.
gamani, and P. Pérez, “Deep reinforcement learning for autonomous [24] Y. Yang, Y. Jiang, Y. Liu, J. Chen, and S. E. Li, “Model-free safe
driving: A survey,” IEEE Transactions on Intelligent Transportation reinforcement learning through neural barrier certificate,” IEEE Robotics
Systems, 2021. and Automation Letters, vol. 8, no. 3, pp. 1295–1302, 2023.

Placenta Checklist
90% (10)
Placenta Checklist
2 pages
Amharic Lessons
100% (1)
Amharic Lessons
266 pages
Aerospace 10 00590
No ratings yet
Aerospace 10 00590
16 pages
A Rpbust Control
No ratings yet
A Rpbust Control
78 pages
Real-time_Guidance_for_Powered_Landing_of_Reusable_Rockets_via_Deep_Reinforcement_Learning
No ratings yet
Real-time_Guidance_for_Powered_Landing_of_Reusable_Rockets_via_Deep_Reinforcement_Learning
6 pages
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
No ratings yet
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
7 pages
Learning To Fly Thesis
No ratings yet
Learning To Fly Thesis
111 pages
Reinforcement learning in spacecraft control applications_ Advances, prospects, and challenges
No ratings yet
Reinforcement learning in spacecraft control applications_ Advances, prospects, and challenges
23 pages
[3] Deep Reinforcement Learning for Spacecraft Proximity Operations Guidance
No ratings yet
[3] Deep Reinforcement Learning for Spacecraft Proximity Operations Guidance
11 pages
choi-ahn-2025-uncertainty-aware-autonomous-mars-landing-guidance-with-curriculum-reinforcement-learning[1]
No ratings yet
choi-ahn-2025-uncertainty-aware-autonomous-mars-landing-guidance-with-curriculum-reinforcement-learning[1]
20 pages
04.02 OBDP2021 Romanelli
No ratings yet
04.02 OBDP2021 Romanelli
7 pages
Cessna 2
No ratings yet
Cessna 2
20 pages
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
No ratings yet
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
21 pages
hjb-rl
No ratings yet
hjb-rl
9 pages
Reinforcement Learning For Robust Missile Autopilot Design
No ratings yet
Reinforcement Learning For Robust Missile Autopilot Design
10 pages
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
No ratings yet
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
41 pages
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
No ratings yet
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
81 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
FLC TRTC PDF
No ratings yet
FLC TRTC PDF
7 pages
Reinforcement learning-based finite-time
No ratings yet
Reinforcement learning-based finite-time
12 pages
[5] Laboratory Experimentation of Spacecraft Robotic Capture Using Deep-Reinforcement-Learning–Based Guidance
No ratings yet
[5] Laboratory Experimentation of Spacecraft Robotic Capture Using Deep-Reinforcement-Learning–Based Guidance
9 pages
Impact of RL in Robot Control
No ratings yet
Impact of RL in Robot Control
20 pages
2212.03828v1
No ratings yet
2212.03828v1
9 pages
Reinforced Learning-Based Robust Control Design For Unmanned Aerial Vehicle
No ratings yet
Reinforced Learning-Based Robust Control Design For Unmanned Aerial Vehicle
16 pages
2501.08220v1
No ratings yet
2501.08220v1
10 pages
Quadcopter Neural Controller For Take Off and Land - 2023 - Expert Systems With
No ratings yet
Quadcopter Neural Controller For Take Off and Land - 2023 - Expert Systems With
15 pages
2502.20500v1
No ratings yet
2502.20500v1
14 pages
Machine Learning Based Framework for Worst Case Performance Analysis
No ratings yet
Machine Learning Based Framework for Worst Case Performance Analysis
10 pages
He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation
No ratings yet
He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation
12 pages
Dhruv Anirudh DrSandeep (3)
No ratings yet
Dhruv Anirudh DrSandeep (3)
21 pages
Autonomous Unmanned Aerial Vehicle Navigation Using Reinforcement Learning: A Systematic Review
No ratings yet
Autonomous Unmanned Aerial Vehicle Navigation Using Reinforcement Learning: A Systematic Review
24 pages
从LQR角度看RL和控制
No ratings yet
从LQR角度看RL和控制
28 pages
Deep Reinforcement Learning For Drone Delivery
No ratings yet
Deep Reinforcement Learning For Drone Delivery
19 pages
Automated-aerial-suspended-cargo-delivery-through-rein_2017_Artificial-Intel
No ratings yet
Automated-aerial-suspended-cargo-delivery-through-rein_2017_Artificial-Intel
18 pages
2209.02954v1
No ratings yet
2209.02954v1
10 pages
_inline-Formula_ _tex-Math Notation=_LaTeX__$ {H}_{ {_infty }}$ __tex-Math___inline-Formula_ Tracking Control of Completely Unknown Continuous-Time Systems via Off-Policy Reinforcement Learning
No ratings yet
_inline-Formula_ _tex-Math Notation=_LaTeX__$ {H}_{ {_infty }}$ __tex-Math___inline-Formula_ Tracking Control of Completely Unknown Continuous-Time Systems via Off-Policy Reinforcement Learning
13 pages
Control of A Quadrotor With Reinforcement Learning: Jemin Hwangbo, Inkyu Sa, Roland Siegwart and Marco Hutter
No ratings yet
Control of A Quadrotor With Reinforcement Learning: Jemin Hwangbo, Inkyu Sa, Roland Siegwart and Marco Hutter
8 pages
Linear quadratic tracking control of unknown systems
No ratings yet
Linear quadratic tracking control of unknown systems
10 pages
Low-Level Control of a Quadrotor With Deep
No ratings yet
Low-Level Control of a Quadrotor With Deep
7 pages
Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning
No ratings yet
Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning
5 pages
[1] Computational missile guidance a deep reinforcement learning approach
No ratings yet
[1] Computational missile guidance a deep reinforcement learning approach
12 pages
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
DownloadFull TextResearchReportPDFPp1 18 Signed
No ratings yet
DownloadFull TextResearchReportPDFPp1 18 Signed
18 pages
2019_RL_Control_Review
No ratings yet
2019_RL_Control_Review
27 pages
Oakes
No ratings yet
Oakes
14 pages
Robust_Wind-Resistant_Hovering_Control_of_Quadrotor_UAVs_Using_Deep_Reinforcement_Learning
No ratings yet
Robust_Wind-Resistant_Hovering_Control_of_Quadrotor_UAVs_Using_Deep_Reinforcement_Learning
10 pages
[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning
No ratings yet
[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning
22 pages
Vertical Take-Off and Landing System Control Using Deep Reinforcement Learning
No ratings yet
Vertical Take-Off and Landing System Control Using Deep Reinforcement Learning
6 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
1-s2.0-S0005109822002163-main
No ratings yet
1-s2.0-S0005109822002163-main
9 pages
Incorporating Recurrent Reinforcement Learning Into Model Predictive
No ratings yet
Incorporating Recurrent Reinforcement Learning Into Model Predictive
7 pages
Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions
No ratings yet
Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions
15 pages
QT-Opt, Scalable Deep RL For Vision-Based Robotic Manipulation
No ratings yet
QT-Opt, Scalable Deep RL For Vision-Based Robotic Manipulation
23 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
Autonomous_Decision-Making_Generation_of_UAV_based_on_Soft_Actor-Critic_Algorithm-1
No ratings yet
Autonomous_Decision-Making_Generation_of_UAV_based_on_Soft_Actor-Critic_Algorithm-1
6 pages
IIST_controls (1)
No ratings yet
IIST_controls (1)
75 pages
Kamala Pur Kar 2016
No ratings yet
Kamala Pur Kar 2016
11 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
RL Based Optimal Tracking Controller Etc.pdf (1)
No ratings yet
RL Based Optimal Tracking Controller Etc.pdf (1)
11 pages
Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions
No ratings yet
Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions
14 pages
Mastering OpenGL: From Basics to Advanced Rendering Techniques: OpenGL
From Everand
Mastering OpenGL: From Basics to Advanced Rendering Techniques: OpenGL
Kameron Hussain
No ratings yet
CityLoc
No ratings yet
CityLoc
20 pages
Creating Virtual Environments with 3D Gaussian Splatting
No ratings yet
Creating Virtual Environments with 3D Gaussian Splatting
2 pages
Rethinking Inductive Biases For Surface Normal Estimation
No ratings yet
Rethinking Inductive Biases For Surface Normal Estimation
14 pages
YOLO World
No ratings yet
YOLO World
15 pages
Unified Monocular 3D Object Detection
No ratings yet
Unified Monocular 3D Object Detection
10 pages
Point-Trajectory Transformer For Efficient Temporal 3D Object Detection
No ratings yet
Point-Trajectory Transformer For Efficient Temporal 3D Object Detection
10 pages
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
No ratings yet
NASM Intel x86 Assembly Language Cheat Sheet: Instruction Effect Examples
1 page
PLMSC Clinical Laboratory Price Lists July 2022
No ratings yet
PLMSC Clinical Laboratory Price Lists July 2022
2 pages
Introduction To IF Statements in Excel
No ratings yet
Introduction To IF Statements in Excel
7 pages
Book of Abstracts LUMEN Conference 2012
No ratings yet
Book of Abstracts LUMEN Conference 2012
264 pages
Cell Lab
No ratings yet
Cell Lab
3 pages
Activity Worksheet in CSS
No ratings yet
Activity Worksheet in CSS
5 pages
Japanese Particles in Action PDF
No ratings yet
Japanese Particles in Action PDF
110 pages
Frid Ffhs26 Tech Sheet
No ratings yet
Frid Ffhs26 Tech Sheet
2 pages
Water Testing: The Principles and Techniques Used in Testing Different Types of Water
No ratings yet
Water Testing: The Principles and Techniques Used in Testing Different Types of Water
1 page
Fault Codes: Petrol (M54B30 - 306S3 - Siemens MS 43 - MT/AT)
No ratings yet
Fault Codes: Petrol (M54B30 - 306S3 - Siemens MS 43 - MT/AT)
4 pages
Final - Result Post-Basic-2022-Net
No ratings yet
Final - Result Post-Basic-2022-Net
2 pages
AI Troponin
No ratings yet
AI Troponin
10 pages
rohini_89775758742
No ratings yet
rohini_89775758742
7 pages
Magnetic Effect of Current - Level - 1 - DTS 1
No ratings yet
Magnetic Effect of Current - Level - 1 - DTS 1
3 pages
Multidimensional Integration: 2.1 Line Integrals
No ratings yet
Multidimensional Integration: 2.1 Line Integrals
37 pages
Matrilocal Lab Report
No ratings yet
Matrilocal Lab Report
2 pages
OBE PrinLang
No ratings yet
OBE PrinLang
20 pages
Matura Test 2
100% (2)
Matura Test 2
8 pages
Timestomp and Autopsy
No ratings yet
Timestomp and Autopsy
3 pages
Juliet Pegrum - The Vastu Home (2002, Ulysses Press) - Libgen - Li - 155
No ratings yet
Juliet Pegrum - The Vastu Home (2002, Ulysses Press) - Libgen - Li - 155
1 page
Ese 2018 Mock Test Ce
No ratings yet
Ese 2018 Mock Test Ce
21 pages
Mcclymont 2020
No ratings yet
Mcclymont 2020
6 pages
Functional Specification Diesel Fuel System PDF
100% (1)
Functional Specification Diesel Fuel System PDF
5 pages
Indoor Selectable-Output Strobes and Horn Strobes For Ceiling Applications
No ratings yet
Indoor Selectable-Output Strobes and Horn Strobes For Ceiling Applications
4 pages
Govi - Physics - Project Class 12
No ratings yet
Govi - Physics - Project Class 12
15 pages
Chocolate Layers2017 1
100% (1)
Chocolate Layers2017 1
10 pages
Ee Apt0 3 BMS301
No ratings yet
Ee Apt0 3 BMS301
1 page
Elf - Sporti 9 C2C3 5W-30 - GC3 - 201708 - en
No ratings yet
Elf - Sporti 9 C2C3 5W-30 - GC3 - 201708 - en
1 page

Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning

Uploaded by

Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning

Uploaded by

Rocket Landing Control with Random Annealing Jump Start

Abstract—Rocket recycling is a crucial pursuit in aerospace

oriented nature of the problem poses difficulties for standard RL Stage

for the trajectory tracking baseline mentioned later, share an

0.6 Landing rate 0.9953

(b) Action fluctuation (lower is better)

0 [7] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional

near optimal policies with low inherent bellman error,” in International

You might also like