Low-Level Control of a Quadrotor With Deep
Low-Level Control of a Quadrotor With Deep
4, OCTOBER 2019
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
LAMBERT et al.: LOW-LEVEL CONTROL OF A QUADROTOR WITH DEEP MODEL-BASED REINFORCEMENT LEARNING 4225
shooter’ MPC, which can be efficiently parallelized on a graphic conditioned on previously existing internal controllers. The gen-
processing unit (GPU) to execute low-level, real-time control. eral nature of our model from sensors to actuators demonstrates
Using MBRL, we demonstrate controlled hover of a Crazyflie the potential for use on robots with no previous controller — we
via on-board sensor measurements and application of pulse only use the quadrotor as the basis for comparison and do not
width modulation (PWM) motor voltage signals. Our method expect it to be the limits of the MBRL system’s functionality.
for quickly learning controllers from real-world data is not yet
an alternative to traditional controllers such as PID, but it opens B. Learning for Quadrotors
important avenues of research. The general mapping of the for-
Although learning-based approaches have been widely ap-
ward dynamics model, in theory, allows the model to be used for
plied for trajectory control of quadrotors, implementations typ-
control tasks beyond attitude control. Additionally, we highlight
ically rely on sending controller outputs as setpoints to stable
the capability of leveraging the predictive models learned on
on-board attitude and thrust controllers. Iterative learning con-
extremely little data for working at frequencies ≤50 Hz, while
trol (ILC) approaches [14], [15] have demonstrated robust con-
a hand tuned PID controller at this frequency failed to hover
trol of quadrotor flight trajectories but require these on-board
the Crazyflie. With the benefits outlined, the current MBRL
controllers for attitude setpoints. Learning-based model predic-
approach has limitations in performance and applicability to
tive control implementations, which successfully track trajecto-
our goal of use with other robots. The performance in this letter
ries, also wrap their control around on-board attitude controllers
has notable room for improvement by mitigating drift. Future
by directly sending Euler angle or thrust commands [16], [17].
applications are limited by our approach’s requirement of a
Gaussian process-based automatic tuning of position controller
high power external GPU – a prohibitively large computational
gains has been demonstrated [18], but only in parallel with
footprint when compared to standard low-level controllers – and
on-board controllers tuned separately.
by the method’s potential for collisions when learning.
Model-free reinforcement learning has been shown to gen-
The resulting system achieves repeated stable hover of up to
erate control policies for quadrotors that out-performs linear
6 seconds, with failures due to drift of unobserved states, within
MPC [4]. Although similarly motivated by a desire to generate a
3 minutes of fully-autonomous training data. These results
control policy acting directly on actuator inputs, the work used
demonstrate the ability of MBRL to control robotic systems in
an external vision system for state error correction, operated
the absence of a priori knowledge of dynamics, pre-configured
with an internal motor speed controller enabled (i.e., thrusts
internal controllers for stability or actuator response smoothing,
were commanded and not motor voltages), and generated a large
and expert demonstration.
fraction of its data in simulation.
Researchers of system identification for quadrotors also apply
II. RELATED WORK machine learning techniques. Bansal et al. used NN models of
A. Attitude and Hover Control of Quadrotors the Crazyflie’s dynamics to plan trajectories [5]. Our imple-
mentation differs by directly predicting change in attitude with
Classical controllers (e.g., PID, LQR, iLQR) in conjunction
on-board IMU measurements and motor voltages, rather than
with analytic models for the rigid body dynamics of a quadrotor
predicting with global, motion-capture state measurements and
are often sufficient to control vehicle attitude [6]. In addition,
thrust targets for the internal PIDs. Using Bayesian Optimization
linearized models are sufficient to simultaneously control for
to learn a linearized quadrotor dynamics model demonstrated
global trajectory attitude setpoints using well-tuned nested PID
capabilities for tuning of an optimal control scheme [19]. While
controllers [7]. Standard control approaches show impressive
this approach is data-efficient and is shown to outperform ana-
acrobatic performance with quadrotors, but we note that we
lytic models, the model learned is task-dependent. Our MBRL
are not interested in comparing our approach to finely-tuned
approach is task agnostic by only requiring a change in objective
performance; the goal of using MBRL in this context is to
function and no new dynamics data for a new function.
highlight a solution that automatically generates a functional
controller in less or equal time than initial PID hand-tuning,
with no foundation of dynamics knowledge. C. Model-Based Reinforcement Learning
Research focusing on developing novel low-level attitude Functionality of MBRL is evident in simulation for multi-
controllers shows functionality in extreme nonlinear cases, such ple tasks in low data regimes, including quadrupeds [20] and
as for quadrotors with a missing propeller [8], with multiple manipulation tasks [21]. Low-level MBRL control (i.e., with
damaged propellers [9], or with the capability to dynamically direct motor input signals) of an RC car has been demonstrated
tilt its propellers [10]. Optimal control schemes have demon- experimentally, but the system is of lower dimensionality and
strated results on standard quadrotors with extreme precision has static stability [22]. Relatively low-level control (i.e., mostly
and robustness [11]. thrust commands only passed through an internal governor
Our work differs by specifically demonstrating the possibility before conversion to motor signals) of an autonomous helicopter
of attitude control via real-time external MPC. Unlike other has been demonstrated, but required a ground-based vision
work on real-time MPC for quadrotors which focus on trajectory system for error correction in state estimates as well as expert
control [12], [13], ours uses a dynamics model derived fully demonstration for model training [22].
from in-flight data that takes motors signals as direct inputs. Properly optimized NNs trained on experimental data
Effectively, our model encompasses only the actual dynamics show test error below common analytic dynamics models for
of the system, while other implementations learn the dynamics flying vehicles, but the models did not include direct actuator
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
4226 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 4, OCTOBER 2019
Fig. 3. The NN dynamics model predicts the mean and variance of the change
in state given the past 4 state-action pairs. We use 2 hidden layers of width 250
neurons.
Fig. 4. Predicted states for N = 50 candidate actions with the chosen “best
design. All layers except for the output layer use the Swish action” highlighted in red. The predicted state evolution is expected to diverge
from the ground truth for future t because actions are re-planned at every step.
activation function [28] with parameter β = 1. The network
structure was cross validated offline for prediction accuracy
verses potential control frequency. Initial validation of training from PWM 0 to max or from max to 0 is on the order of 250 ms, so
parameters was done on early experiments, and the final values our update time-step of 20 ms is short enough for motor spin-up
are held constant for each rollout in the experiments reported in to contribute to learned dynamics. To account for spin-up, we
Section VI. The validation set is a random subset of measured append past system information to the current state and PWMs
(st , at , st+1 ) tuples in the pruned data. to generate an input into the NN model that includes past time.
Additional dynamics model accuracy could be gained with From the exponential step response and with a bounded possible
systematic model verification between rollouts, but experimen- PWM value within peq ± 5000, the motors need approximately
tal variation in the current setup would limit empirical insight and 25 ms to reach the desired rotor speed; when operating at 50 Hz,
a lower model loss does not guarantee improved flight time. Our the time step between updates is 20 ms, leading us to an appended
initial experiments indicate improved flight performance with states and PWMs history of length 4. This state action history
forward dynamics models minimizing the mean and variance length was validated as having the lowest test error on our
of state predictions versus models minimizing mean squared data-set (lengths 1 to 10 evaluated). This yields the final input
prediction error, but more experiments are needed to state clear of length 52 to our NN, ξ, with states and actions combined to
relationships between more model parameters and flight perfor- T
ξt = st st−1 st−2 st−3 at at−1 at−2 at−3 .
mance.
Training a probabilistic NN to approximate the dynamics
model requires pruning of logged data (e.g. dropped packets) V. LOW LEVEL MODEL-BASED CONTROL
and scaling of variables to assist model convergence. Our state st
is the vector of Euler angles (yaw, pitch, and roll), linear accel- This section explains how we incorporate our learned forward
erations, and angular accelerations, reading dynamics model into a functional controller. The dynamics
model is used for control by predicting the state evolution given a
T certain action, and the MPC provides a framework for evaluating
st = ω̇x , ω̇y , ω̇z , φ, θ, ψ, ẍ, ÿ, z̈ . (2)
many action candidates simultaneously. We employ a ‘random
The Euler angles are from the an internal complementary shooter’ MPC, where a set of N randomly generated actions
filter, while the linear and angular accelerations are measured are simulated over a time horizon T . The best action is decided
directly from the on-board MPU-9250 9-axis IMU. In practice, by a user designed objective function that takes in the simulated
for predicting across longer time horizons, modeling accelera- trajectories X̂(a, st ) and returns a best action, a∗ , as visualized in
tion values as a global next state rather than a change in state Figure 4. The objective function minimizes the receding horizon
increased the length of time horizon in composed predictions cost of each state from the end of the prediction window to the
before the models diverged. While the change in Euler angle current measurement.
predictions are stable, the change in raw accelerations vary The candidate actions, {ai = (ai,1 , ai,2 , ai,3 , ai,4 )}N i=1 , are
widely with sensor noise and cause non-physical dynamics 4-tuples of motor PWM values centered around the stable hover-
predictions, so all the linear and angular accelerations are trained point for the Crazyflie. The candidate actions are constant across
to fit the global next state. the prediction time horizon T . For a single sample ai , each
We combine the state data with the four PWM values, at = ai,j is chosen from a uniform random variable on the interval
[m1 , m2 , m3 , m4 ]T , to get the system information at time t. [peq,j − σ, peq,j + σ], where peq,j is the equilibrium PWM value
The NNs are cross-validated to confirm using all state data for motor j. The range of the uniform distribution is controlled
(i.e., including the relatively noisy raw measurements) improves by the tuned parameter σ; this has the effect of restricting the
prediction accuracy in the change in state. variety of actions the Crazyflie can take. For the given range of
While the dynamics for a quadrotor are often represented as PWM values for each motor, [peq − σ, peq + σ], we discretize
a linear system, for a Micro Air Vehicle (MAV) at high control the candidate PWM values to a step size of 256 to match the
frequencies motor step response and thrust asymmetry heavily future compression into a radio packet. This discretization of
impact the change in state, resulting in a heavily nonlinear available action choices increases the coverage of the candidate
dynamics model. The step response of a Crazyflie motor RPM action space. The compression of PWM resolution, while helpful
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
4228 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 4, OCTOBER 2019
Our MPC operates on a time horizon T = 12 to leverage the The drift showcases the challenge of using attitude controllers
predictive power of our model. Higher control frequencies can to mitigate an offset in velocity.
run at a cost of prediction horizon, such as T = 9 at 75 Hz or
T = 6 at 100 Hz. The computational cost is proportional to the B. Learning Process
product of model size, number of actions (N ), and time horizon
The learning process follows the RL framework of collecting
(T ). At high frequencies the time spanned by the dynamics
data and iteratively updating the policy. We trained an initial
model predictions shrinks because of a smaller dynamics step in
model f0 on 124 and 394 points of dynamics data at 25 Hz
prediction and by having less computation for longer T , limiting
and 50 Hz, respectively, from the Crazyflie being flown by a
performance. At 50 Hz, a time horizon of 12 corresponds to a
random action controller. Starting with this initial model as the
prediction of 240 ms into the future. Tuning the parameters of
MPC plant, the Crazyflie undertakes a series of autonomous
this methodology corresponds to changes in the likelihood of
flights from the ground with a 250 ms ramp up, open-loop takeoff
taking the best action, rather than modifying actuator responses,
followed by on-policy control while logging data via radio. Each
and therefore its effect on performance is less sensitive than
roll-out is a series of 10 flights, which causes large variances
changes to PID or standard controller parameters. At 50 Hz, the
in flight time. The initial roll-outs have less control authority
predictive power is strong, but the relatively low control frequen-
and inherently explore more extreme attitude orientations (often
cies increases susceptibility to disturbances in between control
during crashes), which is valuable to future iterations that wish
updates. A system running with an Nvidia Titan Xp attains a
to recover from higher pitch and/or roll. The random and first
maximum control frequency of 230 Hz with N = 5000, T = 1.
three controlled roll-outs at 50 Hz are plotted in Figure 6 to show
For testing we use locked frequencies of 25 Hz and 50 Hz at
the rapid improvement of performance with little training data.
N = 5000, T = 12.
The full learning curves are shown in Figure 5. At both 25 Hz
and 50 Hz the rate of flight improvement reaches its maximum
VI. EXPERIMENTAL EVALUATION once there is 1,000 trainable points for the dynamics model,
We now describe the setting used in our experiments, the which takes longer to collect at the lower control frequency. The
learning process of the system, and the performance summary improvement is after roll-out 1 at 50 Hz and roll-out 5 at 25 Hz.
of the control algorithm. Videos of the flying quadrotor, and The longest individual flights at both control frequencies is over
full code for controlling the Crazyflie and reproducing the 5 s. The final models at 25 Hz and 50 Hz are trained on 2,608 and
experiments, are available online at https://sites.google.com/ 9,655 points respectively, but peak performance is earlier due to
berkeley.edu/mbrl-quadrotor/ dynamics model convergence and hardware lifetime limitations.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.
LAMBERT et al.: LOW-LEVEL CONTROL OF A QUADROTOR WITH DEEP MODEL-BASED REINFORCEMENT LEARNING 4229
Fig. 6. The pitch over time for each flight in the first four roll-outs of learning at 50 Hz, showing the rapid increase in control ability on limited data. The random
and first controlled roll-out show little ability, but roll-out 3 is already flying for > 2 seconds.
Fig. 8. A full flight of Euler angle state data with frames of the corresponding video. This flight would have continued longer if not for drifting into the wall. The
relation between physical orientation and pitch and roll is visible in the frames. The full video is online on the accompanying website.
In parallel with addressing the limitations outlined in Sec- [13] M. Abdolhosseini, Y. Zhang, and C. A. Rabbath, “An efficient model
tion VII, the quadrotor results warrant investigation into low predictive control scheme for an unmanned quadrotor helicopter,” J. Intell.
Robot. Syst., vol. 70, no. 1–4, pp. 27–38, 2013.
level control of other robots. The emergent area of microrobotics [14] A. P. Schoellig, F. L. Mueller, and R. D’Andrea, “Optimization-based
combines the issues of under-characterized dynamics, weak or iterative learning for precise quadrocopter trajectory tracking,” Auton.
non-existent controllers, “fast” dynamics and therefore instabil- Robots, vol. 33, no. 1/2, pp. 103–127, 2012.
[15] C. Sferrazza, M. Muehlebach, and R. D’Andrea, “Trajectory tracking and
ities, and high cost-to-test [29], [30], so it is a strong candidate iterative learning on an unmanned aerial vehicle using parametrized model
for MBRL experiments. predictive control,” in Proc. IEEE Conf. Decis. Control, 2017, pp. 5186–
5192.
[16] P. Bouffard, A. Aswani, and C. Tomlin, “Learning-based model predic-
ACKNOWLEDGMENT tive control on a quadrotor: Onboard implementation and experimen-
tal results,” in Proc. IEEE Int. Conf. Robot. Autom., 2012, pp. 279–
The authors would like to thank the UC Berkeley Sen- 284.
sor & Actuator Center (BSAC), Berkeley DeepDrive, and Nvidia [17] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based
Inc. model predictive control for safe exploration,” in Proc. IEEE Conf. Decis.
Control, 2018, pp. 6059–6066.
[18] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller optimiza-
REFERENCES tion for quadrotors with Gaussian processes,” in Proc. IEEE Int. Conf.
Robot. Autom., 2016, pp. 491–496.
[1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for [19] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin, “Goal-driven
data-efficient learning in robotics and control,” IEEE Trans. Pattern Anal. dynamics learning via Bayesian optimization,” in Proc. IEEE Conf. Decis.
Mach. Intell., vol. 37, no. 2, pp. 408–423, Feb. 2015. Control, 2017, pp. 5168–5173.
[2] G. Williams et al., “Information theoretic MPC for model-based reinforce- [20] I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C.
ment learning,” in Proc. Int. Conf. Robot. Autom., 2017, pp. 1714–1721. Finn, “Learning to adapt: Meta-learning for model-based control,” 2018,
[3] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement arXiv:1803.11347.
learning in a handful of trials using probabilistic dynamics models,” in [21] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and
Proc. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 4759–4770. G. Neumann, “Model-based contextual policy search for data-efficient
[4] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor generalization of robot skills,” Artif. Intell., vol. 247, pp. 415–439,
with reinforcement learning,” IEEE Robot. Autom. Lett., vol. 2, no. 4, 2017.
pp. 2096–2103, Oct. 2017. [22] P. Abbeel, Apprenticeship Learning and Reinforcement Learning With
[5] S. Bansal, A. K. Akametalu, F. J. Jiang, F. Laine, and C. J. Tomlin, Application to Robotic Control. Stanford, CA, USA: Stanford Univ., 2008.
“Learning quadrotor dynamics using neural network for flight control,” [23] A. Punjani and P. Abbeel, “Deep learning helicopter dynamics
in Proc. IEEE Conf. Decis. Control, 2016, pp. 4653–4660. models,” in Proc. IEEE Int. Conf. Robot. Autom., May 2015,
[6] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles,” IEEE pp. 3223–3230.
Robot. Autom. Mag., vol. 19, no. 3, pp. 20–32, Sep. 2012. [24] A. Nagabandi et al., “Learning image-conditioned dynamics models for
[7] D. Mellinger, N. Michael, and V. Kumar, “Trajectory generation and control of underactuated legged millirobots,” in Proc. IEEE Int. Conf.
control for precise aggressive maneuvers with quadrotors,” Int. J. Robot. Intell. Robots Syst., 2018, pp. 4606–4613.
Res., vol. 31, no. 5, pp. 664–674, 2012. [25] A. Bitcraze, “Crazyflie 2.0,” 2016. [Online]. Available: https://www.
[8] W. Zhang, M. W. Mueller, and R. D’Andrea, “A controllable flying vehicle bitcraze.io/crazyflie-2/
with a single moving part,” in Proc. IEEE Int. Conf. Robot. Autom., 2016, [26] W. Hönig and N. Ayanian, “Flying multiple UAVs using ROS,” in Robot
pp. 3275–3281. Operating System. New York, NY, USA: Springer, 2017, pp. 83–118.
[9] M. W. Mueller and R. D’Andrea, “Stability and control of a quadrocopter [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
despite the complete loss of one, two, or three propellers,” in Proc. IEEE 2014, arXiv:1412.6980.
Int. Conf. Robot. Autom., 2014, pp. 45–52. [28] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation
[10] M. Ryll, H. H. Bülthoff, and P. R. Giordano, “Modeling and control of a function,” 2017, arXiv:1710.05941.
quadrotor UAV with tilting propellers,” in Proc. IEEE Int. Conf. Robot. [29] D. S. Drew, N. O. Lambert, C. B. Schindler, and K. S. Pister, “Toward
Autom., 2012, pp. 4606–4613. controlled flight of the ionocraft: A flying microrobot using electrohydro-
[11] H. Liu, D. Li, J. Xi, and Y. Zhong, “Robust attitude controller design for dynamic thrust with onboard sensing and no moving parts,” IEEE Robot.
miniature quadrotors,” Int. J. Robust Nonlinear Control, vol. 26, no. 4, Autom. Lett., vol. 3, no. 4, pp. 2807–2813, Oct. 2018.
pp. 681–696, 2016. [30] D. S. Contreras, D. S. Drew, and K. S. Pister, “First steps of a millimeter-
[12] M. Bangura and R. Mahony, “Real-time model predictive control for scale walking silicon robot,” in Proc. Int. Conf. Solid-State Sensors,
quadrotors,” Int. Fed. Autom. Control Proc. Vol., vol. 47, pp. 11773–11780, Actuators Microsyst., 2017, pp. 910–913.
2014.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on March 17,2025 at 05:38:42 UTC from IEEE Xplore. Restrictions apply.