0% found this document useful (0 votes)
9 views7 pages

Low-Level Control of a Quadrotor With Deep

This document discusses the development of a low-level controller for a Crazyflie quadrotor using model-based reinforcement learning (MBRL), which allows for rapid generation of controllers without prior dynamics knowledge. The MBRL approach demonstrated the ability to achieve stable hovering for up to 6 seconds with only 3 minutes of training data, showcasing its potential for broader applications in robotics. The research highlights the advantages of MBRL over traditional methods, such as PID tuning, while also acknowledging its current limitations in performance and computational requirements.

Uploaded by

lqiang645
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Low-Level Control of a Quadrotor With Deep

This document discusses the development of a low-level controller for a Crazyflie quadrotor using model-based reinforcement learning (MBRL), which allows for rapid generation of controllers without prior dynamics knowledge. The MBRL approach demonstrated the ability to achieve stable hovering for up to 6 seconds with only 3 minutes of training data, showcasing its potential for broader applications in robotics. The research highlights the advantages of MBRL over traditional methods, such as PID tuning, while also acknowledging its current limitations in performance and computational requirements.

Uploaded by

lqiang645
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

4224 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO.

4, OCTOBER 2019

Low-Level Control of a Quadrotor With Deep


Model-Based Reinforcement Learning
Nathan O. Lambert , Daniel S. Drew , Joseph Yaconelli, Sergey Levine,
Roberto Calandra , and Kristofer S. J. Pister

Abstract—Designing effective low-level robot controllers of-


ten entail platform-specific implementations that require manual
heuristic parameter tuning, significant system knowledge, or long
design times. With the rising number of robotic and mechatronic
systems deployed across areas ranging from industrial automation
to intelligent toys, the need for a general approach to generat-
ing low-level controllers is increasing. To address the challenge
of rapidly generating low-level controllers, we argue for using
model-based reinforcement learning (MBRL) trained on relatively
small amounts of automatically generated (i.e., without system
simulation) data. In this letter, we explore the capabilities of MBRL
on a Crazyflie centimeter-scale quadrotor with rapid dynamics
to predict and control at ≤50 Hz. To our knowledge, this is the
first use of MBRL for controlled hover of a quadrotor using only
on-board sensors, direct motor input signals, and no initial dy-
namics knowledge. Our controller leverages rapid simulation of a
neural network forward dynamics model on a graphic processing
unit enabled base station, which then transmits the best current
action to the quadrotor firmware via radio. In our experiments,
Fig. 1. The model predictive control loop used to stabilize the Crazyflie. Using
the quadrotor achieved hovering capability of up to 6 s with 3 min deep model-based reinforcement learning, the quadrotor reaches stable hovering
of experimental training data. with only 10,000 trained datapoints – equivalent to 3 minutes of flight.
Index Terms—Deep learning in robotics and automation, aerial
systems: mechanics and control.
reliance on expert-based controller design, in this letter we
I. INTRODUCTION investigate the question: Is it possible to autonomously learn
competitive low-level controllers for a robot, without simulation
HE ideal method for generating a robot controller would
T be extremely data efficient, free of requirements on domain
knowledge, and safe to run. Current strategies to derive low-level
or demonstration, in a limited amount of time?
To answer this question, we turn to model-based reinforce-
ment learning (MBRL) – a compelling approach to synthesize
controllers are effective across many platforms, but system
controllers even for systems without analytic dynamics models
identification often requires substantial setup and experiment
and with high cost per experiment [1]. MBRL has been shown
time while PID tuning requires some domain knowledge and
to operate in a data-efficient manner to control robotic systems
still results in dangerous roll-outs. With the goal to reduce
by iteratively learning a dynamics model and subsequently
leveraging it to design controllers [2]. Our contribution builds
Manuscript received February 24, 2019; accepted July 7, 2019. Date of
publication July 23, 2019; date of current version August 15, 2019. The work of on simulated results of MBRL [3]. We employ the quadrotor as
J. Yaconelli was supported by the Berkeley Sensors & Actuator Center SUPERB a testing platform to broadly investigate controller generation on
REU Program. This letter was recommended for publication by Associate Editor a highly nonlinear, challenging system, not to directly compare
R. Triebel and Editor T. Asfour upon evaluation of the reviewers’ comments.
(Corresponding author: Nathan O. Lambert.) performance versus existing controllers. This letter is the first
N. O. Lambert, D. S. Drew, S. Levine, and K. S. J. Pister are with demonstration of controlling a quadrotor with direct motor
the Department of Electrical Engineering and Computer Sciences, Uni- assignments sent from a MBRL derived controller learning
versity of California–Berkeley, Berkeley, CA 94720 USA (e-mail: nol@
berkeley.edu; [email protected]; [email protected]; pister@ only via experience. Our work differs from recent progress in
eecs.berkeley.edu). MBRL with quadrotors by exclusively using experimental data
J. Yaconelli is with the University of Oregon, Eugene, OR 97403-1202 USA and focusing on low level control, while related applications
(e-mail: [email protected]).
R. Calandra is with the Facebook AI Research, Menlo Park, CA 94025 USA of learning with quadrotors employ low-level control gener-
(e-mail: [email protected]). ated in simulation [4] or use a dynamics model learned via
This letter has supplementary downloadable material available at experience to command on-board controllers [5]. Our MBRL
http://ieeexplore.ieee.org, provided by the authors. The video shows early and
final results, with a brief discussion of future work. solution, outlined in Figure 1, employs neural networks (NN)
Digital Object Identifier 10.1109/LRA.2019.2930489 to learn a forwards dynamics model coupled with a ‘random
2377-3766 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
LAMBERT et al.: LOW-LEVEL CONTROL OF A QUADROTOR WITH DEEP MODEL-BASED REINFORCEMENT LEARNING 4225

shooter’ MPC, which can be efficiently parallelized on a graphic conditioned on previously existing internal controllers. The gen-
processing unit (GPU) to execute low-level, real-time control. eral nature of our model from sensors to actuators demonstrates
Using MBRL, we demonstrate controlled hover of a Crazyflie the potential for use on robots with no previous controller — we
via on-board sensor measurements and application of pulse only use the quadrotor as the basis for comparison and do not
width modulation (PWM) motor voltage signals. Our method expect it to be the limits of the MBRL system’s functionality.
for quickly learning controllers from real-world data is not yet
an alternative to traditional controllers such as PID, but it opens B. Learning for Quadrotors
important avenues of research. The general mapping of the for-
Although learning-based approaches have been widely ap-
ward dynamics model, in theory, allows the model to be used for
plied for trajectory control of quadrotors, implementations typ-
control tasks beyond attitude control. Additionally, we highlight
ically rely on sending controller outputs as setpoints to stable
the capability of leveraging the predictive models learned on
on-board attitude and thrust controllers. Iterative learning con-
extremely little data for working at frequencies ≤50 Hz, while
trol (ILC) approaches [14], [15] have demonstrated robust con-
a hand tuned PID controller at this frequency failed to hover
trol of quadrotor flight trajectories but require these on-board
the Crazyflie. With the benefits outlined, the current MBRL
controllers for attitude setpoints. Learning-based model predic-
approach has limitations in performance and applicability to
tive control implementations, which successfully track trajecto-
our goal of use with other robots. The performance in this letter
ries, also wrap their control around on-board attitude controllers
has notable room for improvement by mitigating drift. Future
by directly sending Euler angle or thrust commands [16], [17].
applications are limited by our approach’s requirement of a
Gaussian process-based automatic tuning of position controller
high power external GPU – a prohibitively large computational
gains has been demonstrated [18], but only in parallel with
footprint when compared to standard low-level controllers – and
on-board controllers tuned separately.
by the method’s potential for collisions when learning.
Model-free reinforcement learning has been shown to gen-
The resulting system achieves repeated stable hover of up to
erate control policies for quadrotors that out-performs linear
6 seconds, with failures due to drift of unobserved states, within
MPC [4]. Although similarly motivated by a desire to generate a
3 minutes of fully-autonomous training data. These results
control policy acting directly on actuator inputs, the work used
demonstrate the ability of MBRL to control robotic systems in
an external vision system for state error correction, operated
the absence of a priori knowledge of dynamics, pre-configured
with an internal motor speed controller enabled (i.e., thrusts
internal controllers for stability or actuator response smoothing,
were commanded and not motor voltages), and generated a large
and expert demonstration.
fraction of its data in simulation.
Researchers of system identification for quadrotors also apply
II. RELATED WORK machine learning techniques. Bansal et al. used NN models of
A. Attitude and Hover Control of Quadrotors the Crazyflie’s dynamics to plan trajectories [5]. Our imple-
mentation differs by directly predicting change in attitude with
Classical controllers (e.g., PID, LQR, iLQR) in conjunction
on-board IMU measurements and motor voltages, rather than
with analytic models for the rigid body dynamics of a quadrotor
predicting with global, motion-capture state measurements and
are often sufficient to control vehicle attitude [6]. In addition,
thrust targets for the internal PIDs. Using Bayesian Optimization
linearized models are sufficient to simultaneously control for
to learn a linearized quadrotor dynamics model demonstrated
global trajectory attitude setpoints using well-tuned nested PID
capabilities for tuning of an optimal control scheme [19]. While
controllers [7]. Standard control approaches show impressive
this approach is data-efficient and is shown to outperform ana-
acrobatic performance with quadrotors, but we note that we
lytic models, the model learned is task-dependent. Our MBRL
are not interested in comparing our approach to finely-tuned
approach is task agnostic by only requiring a change in objective
performance; the goal of using MBRL in this context is to
function and no new dynamics data for a new function.
highlight a solution that automatically generates a functional
controller in less or equal time than initial PID hand-tuning,
with no foundation of dynamics knowledge. C. Model-Based Reinforcement Learning
Research focusing on developing novel low-level attitude Functionality of MBRL is evident in simulation for multi-
controllers shows functionality in extreme nonlinear cases, such ple tasks in low data regimes, including quadrupeds [20] and
as for quadrotors with a missing propeller [8], with multiple manipulation tasks [21]. Low-level MBRL control (i.e., with
damaged propellers [9], or with the capability to dynamically direct motor input signals) of an RC car has been demonstrated
tilt its propellers [10]. Optimal control schemes have demon- experimentally, but the system is of lower dimensionality and
strated results on standard quadrotors with extreme precision has static stability [22]. Relatively low-level control (i.e., mostly
and robustness [11]. thrust commands only passed through an internal governor
Our work differs by specifically demonstrating the possibility before conversion to motor signals) of an autonomous helicopter
of attitude control via real-time external MPC. Unlike other has been demonstrated, but required a ground-based vision
work on real-time MPC for quadrotors which focus on trajectory system for error correction in state estimates as well as expert
control [12], [13], ours uses a dynamics model derived fully demonstration for model training [22].
from in-flight data that takes motors signals as direct inputs. Properly optimized NNs trained on experimental data
Effectively, our model encompasses only the actual dynamics show test error below common analytic dynamics models for
of the system, while other implementations learn the dynamics flying vehicles, but the models did not include direct actuator
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
4226 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 4, OCTOBER 2019

signals and did not include experimental validation through


controller implementation [23]. A model predictive path
integral (MPPI) controller using a learned NN demonstrated
data-efficient trajectory control of a quadrotor, but results
were only shown in simulation and required the network to be
initialized with 30 minutes of demonstration data with on-board
controllers [2].
MBRL with trajectory sampling for control outperforms, in
terms of samples needed for convergence, the asymptotic perfor-
mance of recent model free algorithms in low dimensional tasks
[3]. Our work builds on strategies presented, with most influence
derived from “probabilistic” NNs, to demonstrate functionality
in an experimental setting — i.e., in the presence of real-world
higher order effects, variability, and time constraints.
NN-based dynamics models with MPC have functioned for
experimental control of an under-actuated hexapod [24]. The Fig. 2. The ROS computer passes control signals and state data between the
hexapod platform does not have the same requirements on MPC node and the Crazyflie ROS server. The Crazyflie ROS server packages
Tx PWM values to send and unpacks Rx compressed log data from the robot.
frequency or control error due to its static stability, and incor-
porates a GPS unit for relatively low-noise state measurements.
Our work has a similar architecture, but has improvements in Crazyflie to ask for logging data in the returning acknowl-
the network model and model predictive controller to allow edgment packet; without decreasing the logging payload and
substantially higher control frequencies with noisy state data. rate we could not simultaneously transmit PWM commands at
By demonstrating functionality without global positioning data, the desired frequency due to radio communication constraints.
the procedure can be extended to more robot platforms where We created a new internal logging block of compressed IMU
only internal state and actuator commands are available to create data and Euler angle measurements to decrease the required
a dynamics model and control policy. bitrate for logging state information, trading state measurement
precision for update frequency. Action commands and logged
state data are communicated asynchronously; the ROS server
III. EXPERIMENTAL SETUP control loop has a frequency set by the ROS rate command, while
In this letter, we use as experimental hardware platform the state data is logged based on a separate ROS topic frequency. To
open-source Crazyflie 2.0 quadrotor [25]. The Crazyflie is 27 g verify control frequency and reconstruct state action pairs during
and 9 cm2 , so the rapid system dynamics create a need for a autonomous rollouts we use a round-trip packet ID system.
robust controller; by default, the internal PID controller used for
attitude control runs at 500 Hz, with Euler angle state estimation IV. LEARNING FORWARD DYNAMICS
updates at 1 kHz. This section specifies the ROS base-station and
The foundation of a controller in MBRL is a reliable forward
the firmware modifications required for external stability control
dynamics model for predictions. In this letter, we refer to the
of the Crazyflie.
current state and action as st and at , which evolve according
All components we used are based on publicly available
to the dynamics f (st , at ). Generating a dynamics model for the
and open source projects. We used the Crazyflie ROS interface
robot often consists of training a NN to fit a parametric function
supported here: https://github.com/whoenig/crazyflie_ros [26].
fθ to predict the next state of the robot as a discrete change
This interface allows for easy modification of the radio com-
in state st+1 = st + fθ (st , at ). In training, using a probabilistic
munication and employment of the learning framework. Our
loss function with a penalty term on the variance of estimates, as
ROS structure is simple, with a Crazyflie subscribing to PWM
shown in Equation (1), better clusters predictions for more stable
values generated by a controller node, which processes radio
predictions across multiple time-steps [3]. The probabilistic
packets sent from the quadrotor in order to pass state variables
loss fits a Gaussian distribution to each output of the network,
to the model predictive controller (as shown in Figure 2). The
represented in total by a mean vector μθ and a covariance
Crazyradio PA USB radio is used to send commands from the
matrix Σθ
ROS server; software settings in the included client increase
the maximum data transmission bitrate up to 2 Mbps and a N

Crazyflie firmware modification improves the maximum traffic l= [μθ (sn , an ) − sn+1 ]T Σ−1
θ (sn , an )[μθ (sn , an ) − sn+1 ]
rate from 100 Hz to 400 Hz. n=1
In packaged radio transmissions from the ROS server we + log det Σθ (sn , an ) . (1)
define actions directly as the pulse-width modulation (PWM)
signals sent to the motors. To assign these PWM values directly The probabilistic loss function assists model convergence and
to the motors we bypass the controller updates in the standard the variance penalty helps maintain stable predictions on longer
Crazyflie firmware by changing the motor power distribution time horizons. Our networks implemented in Pytorch train
whenever a CRTP Commander packet is received (see Figure 2). with the Adam optimizer [27] for 60 epochs with a learning rate
The Crazyflie ROS package sends empty ping packets to the of .0005 and a batch size of 32. Figure 3 summarizes the network
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
LAMBERT et al.: LOW-LEVEL CONTROL OF A QUADROTOR WITH DEEP MODEL-BASED REINFORCEMENT LEARNING 4227

Fig. 3. The NN dynamics model predicts the mean and variance of the change
in state given the past 4 state-action pairs. We use 2 hidden layers of width 250
neurons.

Fig. 4. Predicted states for N = 50 candidate actions with the chosen “best
design. All layers except for the output layer use the Swish action” highlighted in red. The predicted state evolution is expected to diverge
from the ground truth for future t because actions are re-planned at every step.
activation function [28] with parameter β = 1. The network
structure was cross validated offline for prediction accuracy
verses potential control frequency. Initial validation of training from PWM 0 to max or from max to 0 is on the order of 250 ms, so
parameters was done on early experiments, and the final values our update time-step of 20 ms is short enough for motor spin-up
are held constant for each rollout in the experiments reported in to contribute to learned dynamics. To account for spin-up, we
Section VI. The validation set is a random subset of measured append past system information to the current state and PWMs
(st , at , st+1 ) tuples in the pruned data. to generate an input into the NN model that includes past time.
Additional dynamics model accuracy could be gained with From the exponential step response and with a bounded possible
systematic model verification between rollouts, but experimen- PWM value within peq ± 5000, the motors need approximately
tal variation in the current setup would limit empirical insight and 25 ms to reach the desired rotor speed; when operating at 50 Hz,
a lower model loss does not guarantee improved flight time. Our the time step between updates is 20 ms, leading us to an appended
initial experiments indicate improved flight performance with states and PWMs history of length 4. This state action history
forward dynamics models minimizing the mean and variance length was validated as having the lowest test error on our
of state predictions versus models minimizing mean squared data-set (lengths 1 to 10 evaluated). This yields the final input
prediction error, but more experiments are needed to state clear of length 52 to our NN, ξ, with states and actions combined to
relationships between more model parameters and flight perfor-  T
ξt = st st−1 st−2 st−3 at at−1 at−2 at−3 .
mance.
Training a probabilistic NN to approximate the dynamics
model requires pruning of logged data (e.g. dropped packets) V. LOW LEVEL MODEL-BASED CONTROL
and scaling of variables to assist model convergence. Our state st
is the vector of Euler angles (yaw, pitch, and roll), linear accel- This section explains how we incorporate our learned forward
erations, and angular accelerations, reading dynamics model into a functional controller. The dynamics
model is used for control by predicting the state evolution given a
 T certain action, and the MPC provides a framework for evaluating
st = ω̇x , ω̇y , ω̇z , φ, θ, ψ, ẍ, ÿ, z̈ . (2)
many action candidates simultaneously. We employ a ‘random
The Euler angles are from the an internal complementary shooter’ MPC, where a set of N randomly generated actions
filter, while the linear and angular accelerations are measured are simulated over a time horizon T . The best action is decided
directly from the on-board MPU-9250 9-axis IMU. In practice, by a user designed objective function that takes in the simulated
for predicting across longer time horizons, modeling accelera- trajectories X̂(a, st ) and returns a best action, a∗ , as visualized in
tion values as a global next state rather than a change in state Figure 4. The objective function minimizes the receding horizon
increased the length of time horizon in composed predictions cost of each state from the end of the prediction window to the
before the models diverged. While the change in Euler angle current measurement.
predictions are stable, the change in raw accelerations vary The candidate actions, {ai = (ai,1 , ai,2 , ai,3 , ai,4 )}N i=1 , are
widely with sensor noise and cause non-physical dynamics 4-tuples of motor PWM values centered around the stable hover-
predictions, so all the linear and angular accelerations are trained point for the Crazyflie. The candidate actions are constant across
to fit the global next state. the prediction time horizon T . For a single sample ai , each
We combine the state data with the four PWM values, at = ai,j is chosen from a uniform random variable on the interval
[m1 , m2 , m3 , m4 ]T , to get the system information at time t. [peq,j − σ, peq,j + σ], where peq,j is the equilibrium PWM value
The NNs are cross-validated to confirm using all state data for motor j. The range of the uniform distribution is controlled
(i.e., including the relatively noisy raw measurements) improves by the tuned parameter σ; this has the effect of restricting the
prediction accuracy in the change in state. variety of actions the Crazyflie can take. For the given range of
While the dynamics for a quadrotor are often represented as PWM values for each motor, [peq − σ, peq + σ], we discretize
a linear system, for a Micro Air Vehicle (MAV) at high control the candidate PWM values to a step size of 256 to match the
frequencies motor step response and thrust asymmetry heavily future compression into a radio packet. This discretization of
impact the change in state, resulting in a heavily nonlinear available action choices increases the coverage of the candidate
dynamics model. The step response of a Crazyflie motor RPM action space. The compression of PWM resolution, while helpful

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
4228 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 4, OCTOBER 2019

for sampling and communication, represents an uncharacterized


detriment to performance.
Our investigation focuses on controlled hovering, but other
tasks could be commanded with a simple change to the ob-
jective function. The objective we designed for stability seeks
to minimize pitch and roll, while adding additional cost terms
to Euler angle rates. In the cost function, λ effects the ratio
between proportional and derivative gains. Adding cost terms to
predicted accelerations did not improve performance because of
the variance of the predictions. Fig. 5. Mean and standard deviation of the 10 flights during each rollout
learning at 25 Hz and 50 Hz. The 50 Hz shows a slight edge on final performance,
T
 but a much quicker learning ability per flight by having more action changes
a∗ = arg min λ(ψt2 + θt2 ) + ψ̇t2 + θ̇t2 + φ̇2t . (3) during control.
a
t=1

Our MPC operates on a time horizon T = 12 to leverage the The drift showcases the challenge of using attitude controllers
predictive power of our model. Higher control frequencies can to mitigate an offset in velocity.
run at a cost of prediction horizon, such as T = 9 at 75 Hz or
T = 6 at 100 Hz. The computational cost is proportional to the B. Learning Process
product of model size, number of actions (N ), and time horizon
The learning process follows the RL framework of collecting
(T ). At high frequencies the time spanned by the dynamics
data and iteratively updating the policy. We trained an initial
model predictions shrinks because of a smaller dynamics step in
model f0 on 124 and 394 points of dynamics data at 25 Hz
prediction and by having less computation for longer T , limiting
and 50 Hz, respectively, from the Crazyflie being flown by a
performance. At 50 Hz, a time horizon of 12 corresponds to a
random action controller. Starting with this initial model as the
prediction of 240 ms into the future. Tuning the parameters of
MPC plant, the Crazyflie undertakes a series of autonomous
this methodology corresponds to changes in the likelihood of
flights from the ground with a 250 ms ramp up, open-loop takeoff
taking the best action, rather than modifying actuator responses,
followed by on-policy control while logging data via radio. Each
and therefore its effect on performance is less sensitive than
roll-out is a series of 10 flights, which causes large variances
changes to PID or standard controller parameters. At 50 Hz, the
in flight time. The initial roll-outs have less control authority
predictive power is strong, but the relatively low control frequen-
and inherently explore more extreme attitude orientations (often
cies increases susceptibility to disturbances in between control
during crashes), which is valuable to future iterations that wish
updates. A system running with an Nvidia Titan Xp attains a
to recover from higher pitch and/or roll. The random and first
maximum control frequency of 230 Hz with N = 5000, T = 1.
three controlled roll-outs at 50 Hz are plotted in Figure 6 to show
For testing we use locked frequencies of 25 Hz and 50 Hz at
the rapid improvement of performance with little training data.
N = 5000, T = 12.
The full learning curves are shown in Figure 5. At both 25 Hz
and 50 Hz the rate of flight improvement reaches its maximum
VI. EXPERIMENTAL EVALUATION once there is 1,000 trainable points for the dynamics model,
We now describe the setting used in our experiments, the which takes longer to collect at the lower control frequency. The
learning process of the system, and the performance summary improvement is after roll-out 1 at 50 Hz and roll-out 5 at 25 Hz.
of the control algorithm. Videos of the flying quadrotor, and The longest individual flights at both control frequencies is over
full code for controlling the Crazyflie and reproducing the 5 s. The final models at 25 Hz and 50 Hz are trained on 2,608 and
experiments, are available online at https://sites.google.com/ 9,655 points respectively, but peak performance is earlier due to
berkeley.edu/mbrl-quadrotor/ dynamics model convergence and hardware lifetime limitations.

A. Experimental Setting C. Performance Summary


The performance of our controller is measured by the av- This controller demonstrates the ability to hover, following a
erage flight length over each roll-out. Failure is often due to “clean” open-loop takeoff, for multiple seconds (an example is
drift induced collisions, or, as in many earlier roll-outs, when shown in Figure 8). At both 25 Hz and 50 Hz, once reaching
flights reach a pitch or roll angle over 40◦ . In both cases, an maximum performance in the 12 roll-outs, about 30% of flights
emergency stop command is sent to the motors to minimize fail to drift. The failures due to drift indicate the full potential of
damage. Additionally, the simple on-board state estimator shows the MBRL solution to low-level quadrotor control. An example
heavy inconsistencies on the Euler angles following a rapid of a test flight segment is shown in Figure 7, where the control
throttle ramping, which is a potential limiting factor on the response to pitch and roll error is visible.
length of controlled flight. Notably, a quadrotor with internal The basis of comparison, typical quadrotor controllers,
PIDs enabled will still fail regularly due to drift on the same achieve better performance, but with higher control fre-
time frame as our controller; it is only with external inputs that quencies and engineering design iterations leveraging sys-
the internal controllers will obtain substantially longer flights. tem dynamics knowledge. With the continued improvement of

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
LAMBERT et al.: LOW-LEVEL CONTROL OF A QUADROTOR WITH DEEP MODEL-BASED REINFORCEMENT LEARNING 4229

Fig. 6. The pitch over time for each flight in the first four roll-outs of learning at 50 Hz, showing the rapid increase in control ability on limited data. The random
and first controlled roll-out show little ability, but roll-out 3 is already flying for > 2 seconds.

Improvements to the peak performance will come by iden-


tifying causes of the performance plateau. Elements to investi-
gate include the data-limited slow down in improvement of the
dynamics model accuracy, the different collected data distribu-
tions at each roll-out, the stochasticity of NN training, and the
stochasticity at running time with MPC.
Beyond improving performance, computational burden and
safety hinder the applicability of MBRL with MPC to more
systems. The current method requires a GPU-enabled base-
station, but the computational efficiency could be improved with
intelligent action sampling methods or by combining model free
Fig. 7. The performance of the 50 Hz controller. (Above) The controlled PWM techniques, such as learning a deterministic action policy based
values over time, which visibly change in response to angle oscillations. (Below) on the learned dynamics model. We are exploring methods to
Pitch and roll.
generate NN control policies, such as an imitative-MPC network
or a model-free variant, on the dynamics model that could reduce
computational power, the performance of this method should be computation by over 1000x by only evaluating a NN once per
re-characterized as potential control frequencies approach that state measurement. In order to enhance safety, we are interested
of PID controllers. Beyond comparison to PID controllers with in defining safety constraints within the model predictive con-
low computational footprints, the results warrant exploration of troller, rather than just a safety kill-switch in firmware, opening
MBRL for new dynamical systems with or when varying goals the door to fully autonomous learned control from start to
need to be built into low level control. In less than 10 minutes finish.
of clock time, and only 3 minutes of training data, we present
comparable, but limited, performance that is encouraging for
future abilities to match and surpass basic controllers. Moving VIII. CONCLUSIONS AND FUTURE WORK
the balance of this work further towards domain specific control
would likely improve performance, but the broad potential for This work is an exploration of the capabilities of model-based
applications to more and different robotic platforms compels reinforcement learning for low-level control of an a priori un-
exciting future use of MBRL. known dynamic system. The results, with the added challenges
of the static instability and fast dynamics of the Crazyflie,
show the capabilities and future potential of MBRL. We detail
VII. DISCUSSION AND LIMITATIONS
the firmware modifications, system design, and model learning
The system has multiple factors contributing to short length considerations required to enable the use of a MBRL-based
and high variance of flights. First, the PWM equilibrium values MPC system for quadrotor control over radio. We removed all
of the motors shift by over 10% following a collision, causing the robot-specific transforms and higher level commands to only
true dynamics model to shift over time. This problem is partially design the controller on top of a learned dynamics model to
mitigated by replacing the components of the Crazyflie, but any accomplish a simple task. The controller shows the capability
change of hardware causes dynamics model mismatch and the to hover for multiple seconds at a time with less than 3 minutes
challenge persists. Additionally, the internal state estimator does of collected data – approximately half of the full battery life
not track extreme changes in Euler angles accurately. We believe flight time of a Crazyflie. With learned flight in only minutes
that overcoming the system-level and dynamical limitations of of testing, this brand of system-agnostic MBRL is an exciting
controlling the Crazyflie in this manner showcases the expres- solution not only due to its generalizability, but also due to its
sive power of MBRL. learning speed.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
4230 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 4, OCTOBER 2019

Fig. 8. A full flight of Euler angle state data with frames of the corresponding video. This flight would have continued longer if not for drifting into the wall. The
relation between physical orientation and pitch and roll is visible in the frames. The full video is online on the accompanying website.

In parallel with addressing the limitations outlined in Sec- [13] M. Abdolhosseini, Y. Zhang, and C. A. Rabbath, “An efficient model
tion VII, the quadrotor results warrant investigation into low predictive control scheme for an unmanned quadrotor helicopter,” J. Intell.
Robot. Syst., vol. 70, no. 1–4, pp. 27–38, 2013.
level control of other robots. The emergent area of microrobotics [14] A. P. Schoellig, F. L. Mueller, and R. D’Andrea, “Optimization-based
combines the issues of under-characterized dynamics, weak or iterative learning for precise quadrocopter trajectory tracking,” Auton.
non-existent controllers, “fast” dynamics and therefore instabil- Robots, vol. 33, no. 1/2, pp. 103–127, 2012.
[15] C. Sferrazza, M. Muehlebach, and R. D’Andrea, “Trajectory tracking and
ities, and high cost-to-test [29], [30], so it is a strong candidate iterative learning on an unmanned aerial vehicle using parametrized model
for MBRL experiments. predictive control,” in Proc. IEEE Conf. Decis. Control, 2017, pp. 5186–
5192.
[16] P. Bouffard, A. Aswani, and C. Tomlin, “Learning-based model predic-
ACKNOWLEDGMENT tive control on a quadrotor: Onboard implementation and experimen-
tal results,” in Proc. IEEE Int. Conf. Robot. Autom., 2012, pp. 279–
The authors would like to thank the UC Berkeley Sen- 284.
sor & Actuator Center (BSAC), Berkeley DeepDrive, and Nvidia [17] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based
Inc. model predictive control for safe exploration,” in Proc. IEEE Conf. Decis.
Control, 2018, pp. 6059–6066.
[18] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller optimiza-
REFERENCES tion for quadrotors with Gaussian processes,” in Proc. IEEE Int. Conf.
Robot. Autom., 2016, pp. 491–496.
[1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for [19] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin, “Goal-driven
data-efficient learning in robotics and control,” IEEE Trans. Pattern Anal. dynamics learning via Bayesian optimization,” in Proc. IEEE Conf. Decis.
Mach. Intell., vol. 37, no. 2, pp. 408–423, Feb. 2015. Control, 2017, pp. 5168–5173.
[2] G. Williams et al., “Information theoretic MPC for model-based reinforce- [20] I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C.
ment learning,” in Proc. Int. Conf. Robot. Autom., 2017, pp. 1714–1721. Finn, “Learning to adapt: Meta-learning for model-based control,” 2018,
[3] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement arXiv:1803.11347.
learning in a handful of trials using probabilistic dynamics models,” in [21] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and
Proc. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 4759–4770. G. Neumann, “Model-based contextual policy search for data-efficient
[4] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor generalization of robot skills,” Artif. Intell., vol. 247, pp. 415–439,
with reinforcement learning,” IEEE Robot. Autom. Lett., vol. 2, no. 4, 2017.
pp. 2096–2103, Oct. 2017. [22] P. Abbeel, Apprenticeship Learning and Reinforcement Learning With
[5] S. Bansal, A. K. Akametalu, F. J. Jiang, F. Laine, and C. J. Tomlin, Application to Robotic Control. Stanford, CA, USA: Stanford Univ., 2008.
“Learning quadrotor dynamics using neural network for flight control,” [23] A. Punjani and P. Abbeel, “Deep learning helicopter dynamics
in Proc. IEEE Conf. Decis. Control, 2016, pp. 4653–4660. models,” in Proc. IEEE Int. Conf. Robot. Autom., May 2015,
[6] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles,” IEEE pp. 3223–3230.
Robot. Autom. Mag., vol. 19, no. 3, pp. 20–32, Sep. 2012. [24] A. Nagabandi et al., “Learning image-conditioned dynamics models for
[7] D. Mellinger, N. Michael, and V. Kumar, “Trajectory generation and control of underactuated legged millirobots,” in Proc. IEEE Int. Conf.
control for precise aggressive maneuvers with quadrotors,” Int. J. Robot. Intell. Robots Syst., 2018, pp. 4606–4613.
Res., vol. 31, no. 5, pp. 664–674, 2012. [25] A. Bitcraze, “Crazyflie 2.0,” 2016. [Online]. Available: https://www.
[8] W. Zhang, M. W. Mueller, and R. D’Andrea, “A controllable flying vehicle bitcraze.io/crazyflie-2/
with a single moving part,” in Proc. IEEE Int. Conf. Robot. Autom., 2016, [26] W. Hönig and N. Ayanian, “Flying multiple UAVs using ROS,” in Robot
pp. 3275–3281. Operating System. New York, NY, USA: Springer, 2017, pp. 83–118.
[9] M. W. Mueller and R. D’Andrea, “Stability and control of a quadrocopter [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
despite the complete loss of one, two, or three propellers,” in Proc. IEEE 2014, arXiv:1412.6980.
Int. Conf. Robot. Autom., 2014, pp. 45–52. [28] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation
[10] M. Ryll, H. H. Bülthoff, and P. R. Giordano, “Modeling and control of a function,” 2017, arXiv:1710.05941.
quadrotor UAV with tilting propellers,” in Proc. IEEE Int. Conf. Robot. [29] D. S. Drew, N. O. Lambert, C. B. Schindler, and K. S. Pister, “Toward
Autom., 2012, pp. 4606–4613. controlled flight of the ionocraft: A flying microrobot using electrohydro-
[11] H. Liu, D. Li, J. Xi, and Y. Zhong, “Robust attitude controller design for dynamic thrust with onboard sensing and no moving parts,” IEEE Robot.
miniature quadrotors,” Int. J. Robust Nonlinear Control, vol. 26, no. 4, Autom. Lett., vol. 3, no. 4, pp. 2807–2813, Oct. 2018.
pp. 681–696, 2016. [30] D. S. Contreras, D. S. Drew, and K. S. Pister, “First steps of a millimeter-
[12] M. Bangura and R. Mahony, “Real-time model predictive control for scale walking silicon robot,” in Proc. Int. Conf. Solid-State Sensors,
quadrotors,” Int. Fed. Autom. Control Proc. Vol., vol. 47, pp. 11773–11780, Actuators Microsyst., 2017, pp. 910–913.
2014.

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.

You might also like