Robot Operating System 2 - Design and Architecture - 2024
Robot Operating System 2 - Design and Architecture - 2024
Lightweight Communications and Marshalling (LCM) is a middle- via the SROS project. Although successful, it was difficult to maintain
ware that uses a publish/subscribe model with bindings in many and needed further development to meet security trends. These are
languages. It concentrates on handling messaging and data marshaling just two of the many attempts to patch ROS 1, which extended its
in high-bandwidth low-latency environments (20). This limits the useful lifetime but did not solve its core limitations.
range of robotic applications for which LCM can be effectively
used. Open Robot Control Software (OROCOS) is a set of libraries
for robot control, focused on real-time control systems and related ROS 2
topics, such as computing kinematic chains and Bayesian filtering ROS 2 is a software platform for developing robotics applications,
(21). The project has grown into a full framework integrating the also known as a robotics software development kit (SDK). Impor-
Common Object Request Broker Architecture (CORBA) middleware tantly, ROS 2 is open source and distributed under the Apache 2.0
and tooling for deterministic computation in real-time applications. License, which grants users broad rights to modify, apply, and redis-
The LCM and OROCOS frameworks each concentrate on smaller tribute the software, with no obligation to contribute back (22).
pieces of the overall system, with a nontrivial proportion of the over- ROS 2 relies on a federated ecosystem in which contributors are en-
all robotics problem left to the end-user. couraged to create and release their own software. Most additional packages
ROS 1 contains a set of libraries that are useful when building also use the Apache 2.0 License or similar. Making code free is fundamen-
many kinds of robots (1). There are utilities for monitoring processes, tal to driving mass adoption—it allows users to leverage ROS 2 without
introspecting communications, receiving time-series transforma- constraining how they use or distribute their applications.
tions, and more. ROS 1 also has a large ecosystem of sensor, control,
and algorithmic packages made available by community contribu- Scope
tions, enabling a small team to build complex robotics applications. ROS 2 supports a broad range of robotics applications, from education
Although ROS 1 solves many of the complexity issues inherent to and research to product development and deployment. It comprises
With this approach, an application can work across the multiple time Communication patterns
domains that arise from combining physical devices with a host of The ROS 2 APIs provide access to communication patterns. These
software components, each of which may have its own frequency are notably topics, services, and actions that are organized under
for providing data, accepting commands, or signaling events. the concept of a node. ROS 2 also provides APIs for parameters,
Modularity. The UNIX design goal to “make each program do timers, launch, and other auxiliary tools that can be used to design
one thing well” is mirrored (26). Modularity is enforced at multiple a robotic system.
levels, across library APIs, message definitions, command-line tools, and Topics
even the software ecosystem itself. The ecosystem is organized into The most common pattern that users will interact with is topics,
a large number of federated packages, as opposed to a single which are an asynchronous message-passing framework. This is
codebase. similar to other asynchronous frameworks, such as ASIO (36). ROS
We do not pretend that these design principles are universal and 2 provides the same publish-subscribe functionality but focuses on
without trade-offs. Asynchrony can also make it more difficult to using asynchronous messaging to organize a system using strongly
achieve deterministic execution. For any single, well-defined typed interfaces. It does so by organizing end points in a computa-
problem, it is possible to construct a special-purpose monolithic tional graph under the concept of a node. The node is an important
solution that is more computationally efficient because it does not organizational unit that allows a user to reason about a complex
involve abstractions or distributed communication. system, shown in Fig. 1.
However, after a decade of experience with the ROS 1 project, we The anonymous publish-subscribe architecture allows many-to-
claim that adherence to these principles will generally lead to better many communication, which is advantageous for system introspection.
outcomes. This approach facilitates code reuse, software testing, A developer may observe any messages passing on a topic by creating
fault isolation, collaboration within interdisciplinary project teams, a subscription to that topic without any changes.
and cooperation at a global scale. Services
system and the processes that led to it. To that end, a three-part ROS 2 does not struggle in these situations. DDS uses UDP to
approach is continuously executed to measure and expose soft- deliver data, which does not attempt to retransmit data. Instead,
ware quality: DDS decides when and how to retransmit in unreliable conditions.
1) Design documentation: Before a major addition, a written DDS introduces quality of service (QoS) to expose these settings to
rationale and design for the work must be established. This docu- optimize for the available bandwidth and latency.
mentation manifests as a design article or an ROS Enhancement The reliability setting determines whether message delivery is
Proposal (REP) (38, 39). At the time of writing, there are 44 design guaranteed. Using “best-effort,” the publisher will attempt to deliver
articles and seven REPs documenting the design of ROS 2. the message once, useful when new data will make the old obsolete
2) Testing: Each feature in ROS 2 requires tests to ensure that it (e.g., sensor data). Set to “reliable,” the publisher will continue to
behaves correctly. Those tests are executed regularly in continuous send data until the receiver acknowledges receipt.
integration. A combination of unit and integration tests are deployed, The durability QoS setting determines the persistence of a message.
as well as a suite of static analysis tools (“linters”). At the time of “Volatile” messages will be forgotten once after being sent. Mean-
writing, 32,000 to 33,000 tests are run on ROS 2, including 13 linters. while, “transient-local” will store and send late-joining subscriptions
3) Quality declaration: Not every ROS 2 package needs to be data as necessary.
rigorously documented and tested. Thus, a multilevel quality policy A connection’s history determines the behavior when the net-
is defined (40). This policy defines the requirements for each quality work cannot keep up with the data. Set to “keep-all,” all data are
level in terms of development practices, test coverage, security, and retained until the application consumes them. Most applications
more. At the time of writing, 45 ROS 2 packages have achieved the use “keep-last,” which retains a fixed-sized queue of data, overrid-
highest level, quality level 1. ing the oldest as needed. Other settings, including deadline, life span,
liveliness, and lease duration, help in designing real-time systems.
Performance and reliability Experiments were conducted to benchmark the networking
can be found in https://github.com/ros2/performance_test and security infrastructure easy. There are three main concepts in
https://github.com/ros2/buildfarm_perf_tests. DDS security:
The tests comprise one publisher and one subscription. For each Authentication
message size, 1000 messages are sent per second, and the system This establishes the identity of a message or participant in the
records the latency, effective publication rate, and CPU utilization. network. ROS 2 uses digital signatures for authentication, known as
The message sizes are selected to test different aspects, ranging from public key cryptography. SROS2 includes command-line utilities
small to larger messages at key intervals. The test is repeated in for generating and storing these digital signatures.
different processes, within the same process, and within the same Access control
process using intraprocess communication. This allows for fine-grained policies to be applied to the authenti-
The data show that intraprocess communication is the most efficient, cated network participants. It allows a participant to only discover
with 95th percentile latency below 1 ms for all sizes below 8 MB. Intra approved participants and communicate over preapproved network
process is the most reliable, meeting the sending rate for all sizes below interfaces. SROS2 has command-line tools for generating these
8 MB. This bypasses the middleware stack and delivers data by passing configurations.
pointers from the publisher to the subscription. This improvement Encryption
is particularly magnified when working with large messages, around This ensures that third parties cannot eavesdrop or replay data into
1 MB and larger, which are most often associated with images, point- the network. Encryption is performed using Advanced Encryption
clouds, or other forms of high-resolution data. When using node com- Standard Galois/Counter Mode (AES-GCM) symmetric-key crypto
position, the data show a similar story—the 95th percentile latency is graphy. The key material is derived from the shared secret obtained
below 1 ms with no dropped messages for sizes below 8 MB. as part of authentication.
Multiprocess communication allows the publisher and subscrip-
tion to be on separate machines on the network. Expectedly, it also shows
missions to execute. Next, it assembles each task in the mission and reduced access to crucial hardware. At the same time, they were
activates the required capabilities for the particular mission, modeled preparing for a demonstration with the U.S. Air Force (USAF) only
as lifecycle nodes. Last, it executes the mission. months away. Ultimately, the company was successful by pivoting
Most of Ghost’s software is implemented as both lifecycle and their processes to use capabilities made available in ROS 2.
component nodes. The lifecycle nodes are used to dynamically Before the pandemic, most of the development occurred using
activate and deactivate features depending on the current mission robots in their offices. When access to robots was abruptly stopped,
requirements, such as toggling between Global Positioning System– Ghost had to switch development over to the ROS 2 simulator,
based and Visual Inertial Odometry (VIO)–based localization. Gazebo. A single engineer was able to create custom Gazebo plugins
They have dozens of unique capabilities readily available for differ- and simulation files required to represent the quadruped. This simula-
ent missions, which take up little background resources when idle. tion was used to develop the entirety of the USAF demonstration’s
The component nodes are independent modules developed by multiple autonomy system. This new capability is still used long after they
teams and combined at runtime. Ghost found that these strategies were able to return to their offices—it has permitted faster internal
are important when collaborating with a large interdisciplinary development to create custom behaviors and deploy them onto cus-
team on a limited-compute platform. tomer’s robots.
The provided ROS 2 tools allowed Ghost to create a highly flexible ROS 2 as an equalizer
and efficient autonomy system in only a few months. By contrast, ROS 2 is a strong equalizing force for Ghost Robotics. It has helped
the company estimates that it would have taken many years with multiple them compete effectively with well-funded and entrenched competitors.
engineers to create a similar capability if starting from scratch, Rather than building an end-to-end proprietary portfolio of software,
thereby helping support new custom user applications in the wild. they leverage ROS 2’s capabilities where possible. According to
The COVID-19 pandemic Hunter Allen, “We have a competitive product because we have
After the initial coronavirus disease 2019 (COVID-19) lockdowns, the tools needed to make a competitive product. We don’t have to
the robot software team doubled in effective size while having waste time making what ROS 2 already does.” With only 23 employees,
Safe, automated testing Environment for Remote Virtual Exploration (VERVE), which
Flying drones has inherent risks to people and things on the ground, allows operators to visualize the rover’s environment (45). The
as well as to the airframe itself. A great deal of labor and time is operators use the result to simulate a move and then finally execute
required to conduct safe flight testing because every physical flight the move on the rover.
has a risk of crashing. In simulation, however, the cost and risks Mission testing in simulation
associated with test flights are near zero. A failure in simulation can Because VIPER is a spaceflight mission, the team is focused on
be fixed and iterated upon quickly and then rerun. Auterion uses producing highly reliable software. To achieve this, they are exten-
ROS 2’s simulation, Gazebo, to be able to conduct end-to-end tests sively using Gazebo to provide high-fidelity testing of all their
of the software before hardware testing to validate safe functionality. components and systems. Mark Allan said that “having a simulator
Gazebo is used in their continuous integration pipeline to prevent [Gazebo] is essential for the development of all the VIPER software
regressions on an array of vehicle types and scenarios. Tests are run in some capacity.”
in parallel for fast results, which allows developers to focus on a The VIPER team turned to Gazebo to aid in development because
specific problem while remaining confident that the software is safe. it was infeasible to model an accurately functioning lunar rover on
Auterion also leverages simulation testing to validate features in Earth. They emphasized that “the Lunar environment is so unique,
challenging scenarios during development. For example, they can with lighting and gravity, testing in simulation [is] incredibly import-
set up flight regimes or specific situations that are important to ant since its impossible to test on the ground on Earth effectively.”
validate their work. In 2021, Auterion flew approximately 22,000 hours The project was able to create a simulation using custom plugins to
within Gazebo, including high-risk scenarios impractical to test Gazebo’s user interfaces. It is designed for a high degree of custom-
with the hardware. Auterion estimates that these simulations re- ization to support a broad range of robotics needs—even space.
placed 12 full-time engineers to provide the same value in live NASA developed new plugins to model mission specifics, such
tests. The cost of their airframes ranges from $1000 to $100,000, so as camera lens flare, lunar lighting conditions, gravity, and terrain
Fig. 5. VIPER rover in Lunar simulation with command and control software, VERVE. (A) VIPER on Lunar Surface (rendering), (B) Command and Operations Software.
and low reliability. The VIPER team evaluated the options and whole business might not have been feasible. It would have been too
selected DDS as well for the Earth-based operations. expensive.” He estimates that their continuing engineering costs
Besides a communications mechanism, the VIPER team was would be 5 to 10% higher annually without it.
eager to use ROS 2 for its rapid development capabilities, intro- Acceleration of development
spection, and visualization tools, and openly available source. These OTTO Motors’ development and deployment have been sped up in
characteristics shorten the learning curve for new engineers to two additional areas. First, it has accelerated their internal feature
apply what they know onto flight missions. development process. The distributed architecture and isolation of
However, using new software in a flight mission requires a rigorous processes have allowed a large, physically distributed team to
verification and validation (V&V) process. NASA prefers to use collaborate. Using clearly defined ROS 2 interfaces allowed OTTO
components that have been vetted in previous missions; leveraging to separate major classes of tasks. Ryan Gariepy stated in an inter-
heritage software leads to reduced development times and costs view, “at the scale of robots we’re building and the complexity that
(47). VIPER is reusing 84% of the 588,000 lines of code from the is modern manufacturing, you really need the flexibility to
Resource Prospector along with Gazebo and approximately 312 patch in and out capabilities and share across a large team.” Their
open-source ROS 2 packages (46). ROS 2 has not been used in prior product software is spread across many repositories owned by
missions, but the VIPER team decided that the features that it different teams in a diverse set of languages, combined at runtime
provides were worth the extra administrative overhead of going via ROS 2.
through the process. After ROS 2 has been validated and used in Next, providing ROS 2 support has proven valuable to their
ground operations for the VIPER mission, it becomes much easier customers and clients. OTTO and Clearpath sell their platforms to
for ROS 2 to be used in future missions in multiple roles and allow other businesses to build on top of custom products. A company
for more reuse of robotic software between mission programs. recently bought platforms from OTTO to create ultraviolet-sanitizing
robots in response to the COVID-19 pandemic. Because they have
proliferation of ROS expertise in the industry, matched with its 14. M. Montemerlo, N. Roy, S. Thrun, Perspectives on standardization in mobile robot
freely available licensing, has made it the major robotics SDK. By programming: The carnegie mellon navigation (carmen) toolkit, in Proceedings 2003
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.
using ROS 2 and its conventions, they are able to sell platforms that
03CH37453) (IEEE, 2003), vol. 3, pp. 2436–2441.
can be put to work in bespoke applications quickly. 15. C. Mohan, R. Dievendorff, Recent work on distributed commit protocols, and recoverable
It should be noted that these themes—software reuse, collabora- messaging and queuing. Data Eng. 17, 1 (1994).
tions, and trusted platforms—are highly correlated with the design 16. J. Waldo, The jini architecture for network-centric computing. Commun. ACM 42, 76–82
principles laid out in the “Design principles” section. In particular, (1999).
17. OASIS, MQTT Version 5.0: OASIS Standard, (OASIS MQTT Technical Committee, 2019).
they are in line with the design principles of Distribution, Abstraction,
18. B. P. Gerkey, R. T. Vaughan, K. Støy, A. Howard, G. S. Sukhatme, Maja J Matarić, Most
and Modularity. The adherence to those design principles has directly Valuable Player: A robot device server for distributed control, in Proceedings of the IEEE/
resulted in the emergent themes in our studies, which represent RSJ International Conference on Intelligent Robots and Systems (IEEE, 2001).
some of the largest acceleration factors for the robotics indus- 19. G. Metta, P. Fitzpatrick, L. Natale, YARP: Yet another robot platform. Int. J. Adv. Robot. Syst.
try today. 3, 8–48 (2006).
20. A. S. Huang, E. Olson, D. C. Moore, LCM: Lightweight Communications and Marshalling, in
IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE, 2010),
pp. 4057–4062.
CONCLUSION 21. H. Bruyninckx, P. Soetens, B. Koninckx. The real-time motion control core of the Orocos
ROS 2 has been redesigned from the ground up to meet the challenges project, in Proceedings of the IEEE International Conference on Robotics and Automation
of modern robotics. It was designed based off of a thoughtful set of (IEEE, 2003).
principles, modern robotics requirements, and support for exten- 22. Apache Software Foundation, Apache License, Version 2.0 (2021); https://www.apache.
org/licenses/LICENSE-2.0.html [accessed 3 September 2021].
sive customization. Largely based on DDS, ROS 2 is a reliable and
23. K. Birman, T. A. Joseph, Exploiting virtual synchrony in distributed systems, in ACM
high-quality robotics framework that can support a broad range of Symposium on Operating Systems Principles (1987), pp. 123–138.
applications. This framework continues to help accelerate the
44. D. McComas, NASA/GSFC’s Flight Software Core Flight System, in Flight Software Acknowledgments: We would like to thank the companies’ representatives interviewed in
Workshop (Southwest Research Institute San Antonio, Texas, 2012). the case studies. These include H. Allen and J. Laney from Ghost Robotics, C. Cross from
45. S. Y. Lee, S. Lees, T. Cohen, M. Allan, M. Deans, T. Morse, E. Park, T. Smith, Reusable Mission Robotics, N. Marques and M. Achtelik from Auterion, M. Allan and T. Fong from NASA
science tools for analog exploration missions: xgds web tools, verve, and gigapan Ames, and R. Gariepy from OTTO Motors. We would also like to thank the team at Open
voyage. Acta Astronaut. 90, 268–288 (2013). Robotics, members of the ROS 2 Technical Steering Committee, and the community for their
46. S. Stukes, M. Allan, Matthew Deans Georgia Bajjalieh, T. Fong, J. Hihn, H. Utz, An passionate support.
innovative approach to modeling viper rover software life cycle cost, in Proceedings of
the 2021 IEEE Aerospace Conference (50100) (IEEE, 2021). Submitted 4 October 2021
47. C. Price, Heritage Software Save up to 97% on future V&V for real projects (2021); www. Accepted 13 April 2022
nasa.gov/sites/default/files/03-09_ivv_guidance_for_ivv_for_product_line_software.pdf Published 11 May 2022
[accessed 7 September 2021]. 10.1126/scirobotics.abm6074