Manipulation
Manipulation
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will be
updated throughout the Fall 2020 semester.
The PDF version of these notes are autogenerated from the HTML version. There
are a few conversion/formatting artifacts that are easy to fix (please feel free to point
them out). But there are also interactive elements in the HTML version are not easy
to put into the PDF. When possible, I try to provide a link. But I consider the online
HTML version to be the main version.
TABLE OF CONTENTS
• Preface
• Chapter 1: Introduction
◦ Manipulation is more than pick-and-place
◦ Open-world manipulation
◦ Simulation
◦ These notes are interactive
◦ Model-based design and analysis
◦ Organization of these notes
• Chapter 2: Let's get you a robot
◦ Arms
▪ Position-controlled robots
▪ Torque-controlled robots
▪ A proliferation of hardware
▪ Simulating the Kuka iiwa
◦ Hands
▪ Dexterous hands
▪ Simple grippers
▪ Soft/underactuated hands
▪ Other end effectors
▪ If you haven't seen it...
◦ Sensors
◦ Putting it all together
◦ Exercises
• Chapter 3: Basic Pick and Place
◦ Monogram Notation
◦ Pick and place via spatial transforms
◦ Spatial Algebra
◦ Forward kinematics
▪ The kinematic tree
▪ Forward kinematics for pick and place
◦ Differential kinematics (Jacobians)
◦ Differential inverse kinematics
▪ The Jacobian pseudo-inverse
▪ Invertibility of the Jacobian
◦ Defining the grasp and pre-grasp poses
◦ A pick and place trajectory
◦ Putting it all together
◦ Differential inverse kinematics with constraints
▪ Pseudo-inverse as an optimization
▪ Adding velocity constraints
▪ Adding position and acceleration constraints
▪ Joint centering
▪ Alternative formulations
◦ Exercises
• Chapter 4: Geometric Pose Estimation
◦ Cameras and depth sensors
▪ Depth sensors
▪ Simulation
◦ Representations for geometry
◦ Point cloud registration with known correspondences
◦ Iterative Closest Point (ICP)
◦ Dealing with partial views and outliers
▪ Detecting outliers
▪ Point cloud segmentation
▪ Generalizing correspondence
▪ Soft correspondences
▪ Nonlinear optimization
▪ Global optimization
◦ Non-penetration and "free-space" constraints
▪ Free space constraints as non-penetration constraints
◦ Putting it all together
◦ Looking ahead
◦ Exercises
• Chapter 5: Bin Picking
◦ Generating random cluttered scenes
▪ Falling things
▪ Static equilibrium with frictional contact
▪ A few of the nuances of simulating contact
▪ Static equilibrium as an optimization constraint
◦ Grasp selection
▪ Point cloud pre-processing
▪ Estimating normals and local curvature
▪ Evaluating a candidate grasp
▪ Generating grasp candidates
◦ The corner cases
◦ Putting it all together
▪ A simple state machine
◦ Exercises
• Chapter 6: Object detection and segmentation
◦ Getting to big data
▪ Crowd-sourced annotation datasets
▪ Segmenting new classes via fine tuning
▪ Annotation tools for manipulation
▪ Synthetic datasets
◦ Object detection and segmentation
◦ Putting it all together
◦ Exercises
• Chapter 7: Force Control
◦ A simple model
▪ (Direct) force control
▪ A force-based "flip-up" strategy
▪ Indirect force control
▪ A stiffness-control-based "flip-up" strategy
▪ Hybrid position/force control
◦ Peg in hole
◦ Manipulator control
▪ Joint stiffness control
▪ Cartesian stiffness control
▪ Some implementation details on the iiwa
◦ Putting it all together
◦ Exercises
• Chapter 8: Motion Planning
◦ Inverse Kinematics
▪ Closed-form solutions
▪ IK as constrained optimization
▪ Grasp planning as IK
◦ Kinematic trajectory optimization
◦ Sample-based motion planning
◦ Exercises
• Chapter 9: Reinforcement Learning
◦ Exercises
APPENDIX
• Appendix A: Drake
◦ Pydrake
◦ Online Jupyter notebooks
◦ Running on your own machine
▪ Install Drake
◦ Getting help
• Appendix B: Setting up your own "Manipulation Station"
◦ Message Passing
◦ Kuka LBR iiwa + Schunk WSG Gripper
◦ Intel Realsense D415 Depth Cameras
◦ Miscellaneous hardware.
• Appendix C: Miscellaneous
◦ How to cite these notes
◦ Annotation tool etiquette
◦ Please give me feedback!
PREFACE
I've always loved robots, but it's only relatively recently that I've turned my attention
to robotic manipulation. I particularly like the challenge of building robots that can
master physics to achieve human/animal-like dexterity and agility. It was passive
dynamic walkers and the beautiful analysis that accompanies them that first helped
cement this centrality of dynamics in my view of the world and my approach to
robotics. From there I became fascinated with (experimental) fluid dynamics, and
the idea that birds with articulated wings actually "manipulate" the air to achieve
incredible efficiency and agility. Humanoid robots and fast-flying aerial vehicles in
clutter forced me to start thinking more deeply about the role of perception in
dynamics and control. Now I believe that this interplay between perception and
dynamics is truly fundamental, and I am passionate about the observation that
relatively "simple" problems in manipulation (how do I button up my dress shirt?)
expose the problem beautifully.
In the past few years, deep learning has had an unquestionable impact on robotic
perception, unblocking some of the most daunting challenges in performing
manipulation outside of a laboratory or factory environment. We will discuss relevant
tools from deep learning for object recognition, segmentation, pose/keypoint
estimation, shape completion, etc. Now relatively old approaches to learning control
are also enjoying an incredible surge in popularity, fueled in part by massive
computing power and increasingly available robot hardware and simulators. Unlike
learning for perception, learning control algorithms are still far from a technology,
with some of the most impressive looking results still being hard to understand and
to reproduce. But the recent work in this area has unquestionably highlighted the
pitfalls of the conservatism taken by the controls community. Learning researchers
are boldly formulating much more aggressive and exciting problems for robotic
manipulation than we have seen before -- in many cases we are realizing that
some manipulation tasks are actually quite easy, but in other cases we are finding
problems that are still fundamentally hard.
Finally, it feels that the time is ripe for robotic manipulation to have a real and
dramatic impact in the world, in fields from logistics to home robots. Over the
last few years, we've seen UAVs/drones transition from academic curiosities into
consumer products. Even more recently, autonomous driving has transitioned from
academic research to industry, at least in terms of dollars invested. Manipulation
feels like the next big thing that will make the leap from robotic research to practice.
It's still a bit risky for a venture capitalist to invest in, but nobody doubts the size of
the market once we have the technology. How lucky are we to potentially be able to
play a role in that transition?
A MANIPULATION TOOLBOX
If you would like to replicate any or all of the hardware that we use for these notes,
you can find information and instructions in the appendix.
As you use the code, please consider contributing back (especially to the mature
code in DRAKE). Even questions/bug reports can be important contributions. If you
have questions/find issues with these notes, please submit them here.
CHAPTER 1 Open in Colab
Introduction
It's worth taking time to appreciate just how amazingly well we are able to perform
tasks with our hands. Tasks that often feel mundane to us -- loading the dishwasher,
chopping vegetables, folding laundry -- remain as incredibly challenging tasks for
robots and are at the very forefront of robotics research.
Consider the problem of picking up a single plate from a stack of plates in the sink
and placing it into the dishwasher. Clearly you first have to perceive that there is
a plate in the sink and that it is accessible. Getting your hand to the plate likely
requires navigating your hand around the geometry of the sink and other dishes.
The act of actually picking it up might require a fairly subtle maneuver where you
have to tip up the plate, sliding it along your fingers and then along the sink/dishes
in order to get a reasonable grasp on it. Presumably as you lift it out of the sink,
you'd like to mostly avoid collisions between the plate and the sink, which suggests
a reasonable understanding of the size/extent of the plate (though I actually think
robots today are too afraid of collisions). Even placing the plate into the dishwasher
is pretty subtle. You might think that you would align the plate with the slats and
then slide it in, but I think humans are more clever than that. A seemingly better
strategy is to loosen your grip on the plate, come in at an angle and intentionally
contact one side of the slat, letting the plate effectively rotate itself into position as
you set it down. But the justification for this strategy is subtle -- it is a way to achieve
the kinematically accurate task without requiring much kinematic accuracy on the
position/orientation of the plate.
Figure 1.2 - A robot picking up a plate from a potentially cluttered sink (left: in
simulation, right: in reality). This example is from the manipulation team at the
Toyota Research Institute.
Perhaps one of the reasons that these problems remain so hard is that they require
strong capabilities in numerous technical areas that have traditionally been
somewhat disparate; it's challenging to be an expert in all of them. More so than
robotic mapping and navigation, or legged locomotion, or other great areas in
robotics, the most interesting problems in manipulation require significant
interactions between perception, planning, and control. This includes both
geometric perception to understand the local geometry of the objects and
environment and semantic perception to understand what opportunities for
manipulation are available in the scene. Planning typically includes reasoning about
the kinematic and dynamic constraints of the task (how do I command my rigid
seven degree-of-freedom arm to reach into the drawer?). But it also includes higher-
level task planning (to get milk into my cup, I need to open the fridge, then grab
the bottle, then unscrew the lid, then pour... and perhaps even put it all back
when I'm done). The low-level begs for representations using real numbers, but
the higher levels feel more logical and discrete. At perhaps the lowest level, our
hands are making and breaking contact with the world either directly or through
tools, exerting forces, rapidly and easily transitioning between sticking and sliding
frictional regimes -- these alone are incredibly rich and difficult problems from the
perspective of dynamics and control.
There are a large number of applications for manipulation. Picking up an object from
one bin and placing it into another bin -- one version of the famous "pick and place"
problem -- is a great application for robotic manipulation. Robots have done this for
decades in factories with carefully curated parts. In the last few years, we've seen
a new generation of pick-and-place systems that use deep learning for perception,
and can work well with much more diverse sets of objects, especially if the location/
orientation of the placement need not be very accurate. This can be done with
conventional robot hands or more special-purpose end-effectors that, for instance,
use suction. It can often be done without having a very accurate understanding of
the shape, pose, mass, nor friction of the object(s) to be picked.
The goal for these notes, however, is to examine the much broader view of
manipulation than what is captured by the pick and place problem. Even our thought
experiment of loading the dishwasher -- arguably a more advanced type of pick and
place -- requires much more from the perception, planning, and control systems.
But the diversity of tasks that humans (and hopefully soon robots) can do with their
hands is truly remarkable. To see one small catalog of examples that I like, take a
look at the Southampton Hand Assessment Procedure (SHAP), which was designed
as a way to empirically evaluate prosthetic hands. Matt Mason also gave a broad and
thoughtful definition of manipulation in the opening of his 2018 review paper[1].
It's also important to recognize that manipulation research today looks very different
than manipulation research looked like in the 1980s and 1990s. During that time
there was a strong and beautiful focus on "manipulation as grasping," with seminal
work on, for instance, the kinematics/dynamics of multi-fingered hands assuming a
stable grasp on an object. Sometimes still, I hear people use the term "grasping"
as almost synonymous with manipulation. But please realize that the goals of
manipulation research today, and of these notes, are much broader than that. Is
grasping a sufficient description of what your hand is doing when you are buttoning
your shirt? Making a salad? Spreading peanut butter on toast?
The idea that the world has infinite variability (there will never be a point at which
you have seen every possible kitchen) is often referred to as the "open-world" or
"open-domain" problem -- a term popularized first in the context of video games.
It can be tough to strike a balance between rigorous thinking about aspects of the
manipulation problem, and trying to embrace the diversity and complexity of the
entire world. But it's important to walk that line.
There is chance that diversity of manipulation in the open world might actually
make the problem easier. We are just at the dawn of the era of big data in robotics;
the impact this will have cannot be overstated. But I think it is deeper than that.
As an example, some of our optimization formulations for planning and control
might get stuck in local minima now because narrow problem formulations can have
many quirky solutions; but once we ask a controller to work in a huge diversity of
scenarios then the quirky solutions can be discarded and the optimization landscape
may become much simpler. But to the extent that is true, then we should aim to
understand it rigorously!
1.3 SIMULATION
There is another reason that we are now entering a golden age for manipulation
research. Our software tools have (very recently) gotten much better!
It was only about three years ago that I remember talking to my students, who
were all quite adept at using simulation for developing control systems for walking
robots, about using simulation for manipulation. "You can't work on manipulation in
simulation" was their refrain, and for good reason. The complexity of the contact
mechanics in manipulation has traditionally been much harder to simulate than a
walking robot that only interacts with the ground and through a minimal set of
contact points. Looming even larger, though, was the centrality of perception for
manipulation; it was generally accepted that one could not simulate a camera well
enough to be meaningful.
How quickly things can change! The last few years has seen a rapid adoption of
video-game quality rendering by the robotics and computer vision communities.
The growing consensus now is that game-engine renderers can model cameras
well enough not only to test a perception system in simulation, but even to train
perception systems in simulation and expect them to work in the real world! This is
fairly amazing, as we were all very concerned before that training a deep learning
perception system in simulation would allow it to exploit any quirks of the simulated
images that could make the problem easier.
We have also seen dramatic improvements in the quality and performance of contact
simulation. Making robust and performant simulations of multi-body contact
involves dealing with complex geometry queries and stiff (measure-) differential
equations. There is still room for fundamental improvements in the mathematical
formulations of the governing equations, but today's solvers are good enough to be
extremely useful.
By leveraging the power of simulation, and the new availability of free online
interactive compute, I am trying to make these notes more than what you would find
in a traditional text. Each chapter will have working code examples, which you can
run immediately (no installation required) on Google's Colaboratory, or download/
install and run on your local machine (see the appendix for more details). It uses an
open-source library called DRAKE which has been a labor of love for me over the last
7 years or so. Because all of the code is open source, it is entirely up to you how
deep into the rabbit hole you choose to go.
Open in Colab
Open in Colab
Many of you will be familiar with ROS (the Robot Operating System). I believe that
ROS was one of the best things to happen to robotics in the last decade. It meant that
experts from different subdisciplines could easily share their expertise in the form of
modular components. Components (as ROS packages) simply agree on the messages
that they will send and receive on the network; packages can inter-operate even if
they are written in different programming languages or even on different operating
systems.
Although ROS makes it relatively easy to get started with manipulation, it doesn't
serve my pedagogical goal of thinking clearly about manipulation. The modular
approach to authoring the components is extremely good, and we will adopt it here.
But in DRAKE we ask for a little bit more from each of the components -- essentially
that they declare their states, parameters, and timing semantics in a consistent way
-- so that we have a much better chance of understanding the complex relationships
between systems. This has great practical value as well; the ability to debug a full
manipulation stack with repeatable deterministic simulations (even if they include
randomness) is surprisingly rare in the field but hugely valuable.
The key building block in our work will be DRAKE Systems, and systems can be
combined in complex combinations into Diagrams. System diagrams have long
been the modeling paradigm used in controls, and the software aspect of it will be
very familiar to you if you've used tools like Simulink, LabView, or Modelica. These
software tools refer to the block-diagram design paradigm as "model-based design".
This graphic is interactive. Make sure you zoom in and click around to get a sense
for the amount of complexity that we can abstract away in this framework. For
instance, try expanding the iiwa_controller block.
Whenever you are ready to learn more about the Systems Framework in DRAKE, I
recommend starting with the "Modeling Dynamical Systems" tutorial linked from
the main Drake website.
Let me be transparent: not everybody likes this systems framework. Some people
are just trying to move fast, and don't see the benefits of slowing down to declare
their state variables, etc. It will likely feel like a burden to you the first time you
go to write a new system. But it's precisely because we're trying to build complex
systems quickly that I advocate this more rigorous approach. I believe that getting
to the next level of maturity in our open-world manipulation system requires more
maturity from our building blocks.
The remaining chapters of these notes are organized around the component-level
building blocks of manipulation. Many of these components each individually build
on a wealth of literature (e.g. from computer vision, or dynamics and control).
Rather than be overwhelmed, I've chosen to focus on delivering a consistent
coherent presentation of the most relevant ideas from each field as they relate to
manipulation, and pointers to more literature. Even finding a single notation across
all of the fields can be a challenge!
The next few chapters will give you a minimal background on the relevant robot
hardware that we are simulating, on (some of) the details about simulating them,
and on the geometry and kinematics basics that we will use heavily through the
notes.
For the remainder of the notes, rather than organize the chapters into "Perception",
"Planning", and "Control", I've decided to spiral through these topics. In the first
part, we'll do just enough perception, planning, and control to build up a basic
manipulation system that can perform pick-and-place manipulation of known
isolated objects. Then I'll try to motivate each new chapter by describing what our
previous system cannot do, and what we hope it will do by the end of the chapter.
REFERENCES
Although we are going to focus on one particular set of hardware for these notes,
one of the great things about modern robotics is that many of the tools we will
develop over the course of these notes are quite general, and can be transferred
from one robot to another easily. I could imagine a future version of these notes
where you really do get to build out your robot in this chapter, and use your
customized robot for the remaining chapters!
2.1 ARMS
There appear to be a lot of robotics arms available on the market. So how does
one choose? Cost, reliability, usability, payload, range of motion, ...; there are many
important considerations. And the choices we might make for a research lab might
be very different than the choices we might make as a startup.
This is made possible by the proliferation of some common file formats for
describing our robots. Unfortunately, the field has not converged on a single
preferred format (yet), and each of them have their quirks. Drake currently loads
Universal Robot Description Format (URDF) and Simulation Description Format
(SDF); the Drake developers have been trying to upstream improvements to SDF
rather than start yet another format.
Open in Colab
There is one particular requirement which, if we want our robot to meet, quickly
winnows the field to only a few viable platforms: that requirement is joint-torque
sensing and control. It's not absolutely clear that this feature is required, even for
very advanced applications, but as a researcher who cares a great deal about the
contact interactions between my robots and the world, I prefer to have the capability
and explore whether I need it rather than wonder what life might have been like.
Figure 2.1 - Kuka LBR iiwa robot. This one has a 7kg payload.
Out of the torque-controlled robots on the market, I've chosen to use the Kuka LBR
iiwa robot to use throughout these notes (I will try to use the lower case "iiwa"
to be consistent with the manufacturer, but it looks wrong to me every time!). To
better understand why, let us start by understanding the difference between most
robots, which are position-controlled, and the handful of robots that have accepted
the additional cost and complexity to provide torque sensing and control.
Figure 2.2 - Two popular position controlled manipulators. (Left) The UR10 from
Universal Robotics. (Right)The ABB Yumi.
Most robot arms today are "position controlled" -- given a desired joint position
(or joint trajectory), the robot executes it with relatively high precision. Basically
all arms can be position controlled -- if the robot offers a torque control interface
(with sufficiently high bandwidth) then we can certainly regulate positions, too. In
practice, calling a robot "position controlled" is a polite way of saying that it does
not offer torque control. Do you know why position control and not torque control is
the norm?
Lightweight arms like the examples above are actuated with electric motors. For a
reasonably high-quality electric motor (with windings designed to minimize torque
ripple, etc), we expect the torque that the motor outputs to be directly proportional
to the current that we apply:
τmotor = kt i,
where τmotor is the motor torque, i is the applied current, and kt is the "motor
torque constant". (Similarly, applied voltage has a simple (affine) relationship with
the motor's steady-state velocity). If we can control the current, then why can we not
control the torque?
The short answer is that to achieve reasonable cost and weight, we typically choose
small electric motors with large gear reductions, and gear reductions come with
a number of dynamic effects that are very difficult to model -- including backlash,
vibration, and friction. So the simple relationship between current and torque breaks
down. Conventional wisdom is that for large gear ratios (say ≫ 10), the unmodeled
terms are significant enough that they cannot be ignored, and torque is no longer
simply related to current.
Position Control.
How can we overcome this challenge of not having a good model of the transmission
dynamics? Regulating the current or speed of the motor only requires sensors on
the motor side of the transmission; if we add more sensors on the output side of
the transmission, then we can use high-gain feedback to try to directly regulate the
joint, instead of the motor. Importantly, although the torques due to the transmission
are not known precisely, they are also not arbitrary -- for instance they will never
add energy into the system. Most importantly, we can be confident that there is
a monotonically increasing relationship between the current that we put into the
motor and the torque at the joint, and ultimately the acceleration of the joint.
Note that I chose the term monotonic carefully, meaning "non-decreasing" but not
implying "strictly increasing", because, for instance, when a joint is starting from
rest, static friction will resist small torques without having any acceleration at the
output.
The most common sensor to add to the joint is a position sensor -- typically an
encoder or potentiometer -- these are inexpensive, accurate, and robust. In practice,
we think of these as providing (after some signal filtering/conditioning) accurate
measurements of the joint position and joint velocity via differentiation -- joint
accelerations can also be obtained via differentiating twice but are generally
considered more noisy and less suitable for use in tight feedback loops. This is
sufficient for accurately tracking desired position trajectories of the arm. For each
joint, if we are given a desired trajectory q d (t), then I can track this using
proportional-integral-derivative (PID) control:
with kp , kd , and ki being the position, velocity, and integral gains. PID control has a
rich theory, and a trove of knowledge about how to choose the gains, which I will
not reproduce here. I will note, however, that when we simulate position-controlled
robots we use the equation above, commanding a torque directly, but on hardware
our PID controller typically outputs voltage commands (via pulse-width modulation).
Closing this modeling gap has traditionally not been a priority for robot simulation,
but does mean that we often have one list of gains for the physical robot and a
separate list for the simulation.
Electric motors are most efficient at high speeds (often > 100 or 1,000 rotations
per minute). We probably don't actually want our robots to move that fast even if
they could! So nearly all electric robots have fairly substantial gear reductions, often
on the order of 100:1; the transmission output turns one revolution for every 100
rotations of the motor, and the output torque is 100 times greater than the motor
torque. For a gear ratio, n, actuating a joint q, we have
1
qmotor = nq, q˙motor = nq̇ , q¨motor = nq̈ , τmotor = τ.
n
Interestingly, this has a fairly profound impact on the resulting dynamics (given
by f = ma), even for a single joint. Writing the relationship between joint torque
and joint acceleration (no motors yet), we can write ma = ∑ f in the rotational
coordinates as
Iarm q¨ = τgravity + τ,
where Iarm is the moment of inertia. For example, for a simple pendulum, we might
have
But the applied joint torque τ actually comes from the motor -- if we write this
equation in terms of motor coordinates we get:
Iarm
q¨ = τgravity + nτmotor .
n motor
If we divide through by n, and take into account the fact that the motor itself has
inertia (e.g. from the large spinning magnets) that is not affected by the gear ratio,
then we obtain:
Iarm τgravity
(Imotor + )q̈ motor = + τmotor .
n2 n
It's interesting to note that, even though the mass of the motors might make up only
a small fraction of the total mass of the robot, for highly geared robots they can play
a significant role in the dynamics of the joints. We use the term reflected inertia to
denote the inertial load that is felt on the opposite side of a transmission, due to
the scaling effect of the transmission. The "reflected inertia" of the arm at the motor
is cut by the square of the gear ratio; or the "reflected inertia" of the motor at the
arm is multiplied by the square of the gear ratio. This has interesting consequences
-- as we move to the multi-link case, we will see that the inertia of the arm is a
state-dependent function that captures the inertial coupling of the other joints in the
manipulator. The motor inertia, on the other hand, is constant. For large gear ratios,
this means the dynamics of the joint are less dependent on the nonlinear effects of
joint position and velocity, making it relatively easy to tune constant feedback gains
that perform well in all configurations.
Although not as common, there are a number of robots that do support direct control
of the joint torques. There are a handful of ways that this capability can be realized.
It is possible to actuate a robot using electric motors that require only a small gear
reduction (e.g. ≤ 10:1) where the frictional forces are negligible. In the past, these
"direct-drive robots"[1] had enormous motors and limited payloads. More recently,
robots like the Barrett WAM arm used cable drives to keep the arm light but having
large motors in the base. And just in the last few years, we've seen progress in high-
torque outrunner and frameless motors bringing in a new generation of low-cost,
"quasi-direct-drive" robots: e.g. MIT Cheetah [2], Berkeley Blue, and Halodi Eve.
Hydraulic actuators provide another solution for generating large torques without
large transmissions. Sarcos had a series of torque-controlled arms (and humanoids),
and many of the most famous robots from Boston Dynamics are based on hydraulics
(though there is an increasing trend towards electric motors). These robots typically
have a single central pump and each actuator has a (lightweight) valve that can
shunt fluid through the actuator or through a bypass; the differential pressure across
the actuator is at least approximately related to the resulting force/torque.
Another approach to torque control is to keep the large gear-ratio motors, but add
sensors to directly measure the torque at the joint side of the actuator. This is
the approach used by the Kuka iiwa robot that we use in the example throughout
this text; the iiwa actuators have strain gauges integrated into the transmission.
However there is a trade-off between the stiffness of the transmission and the
accuracy of the force/torque measurement [3] -- the iiwa transmission includes an
explicit "Flex Spline" with a stiffness around 5000 Nm/rad [4]. Taking this idea
to an extreme, Gill Pratt proposed "series-elastic actuators" that have even lower
stiffness springs in the transmission, and proposed measuring joint position on
both the motor and joint sides of the transmission to estimate the applied torques
[5]. For example, the Baxter and Sawyer robots from Rethink use series elastic
actuators; I don't think they've published the spring stiffness values but similarly-
motivated series-elastic actuators from HEBI robotics are closer to 100 Nm/rad.
Even for the iiwa actuators, the joint elasticity is significant enough that the low-
level controllers go to great length to take it into account explicitly in order to
achieve high-performance control of the joints[6]. We will discuss these details when
we get to the chapter covering force control.
The low-cost torque-controlled arms that I mentioned above are just the beginnings
in what promises to be a massive proliferation of robotic arms. As demand increases,
costs will continue to come down.
Let me just say that, compared to working on legged robots, where for decades
we did our research on laboratory prototypes built by graduate students (and
occasionally professors!) in the machine shop down the hall, the availability of
professionally engineered, high-quality, high-uptime hardware is an absolute treat.
This also means that we can test algorithms in one lab and have another lab perhaps
at another univeristy testing algorithms on almost identical hardware; this facilitates
levels of repeatability and sharing that were impossible before. The fact that the
prices are coming down, which will mean many more similar robots in many more
labs/environments, is one of the big reasons why I am so optimistic about the next
few years in the field.
It's time to simulate our chosen robotic arm. The first step is to obtain a robot
description file (URDF or SDF). For convenience, we ship the models for a few
robots, including iiwa, with Drake. If you're interested in simulating a different
robot, you can find either a URDF or SDF describing most commercial robots
somewhere online. But a word of warning: the quality of these models can vary
wildly. We've seen surprising errors in even the kinematics (link lengths, geometries,
etc), but the dynamics properties (inertia, friction, etc) in particular are often not
accurate at all. Sometimes they are not even mathematically consistent (e.g. it is
possible to specify an inertial matrix which is not physically realizable by any rigid
body). Drake will complain if you ask it to load a file with this sort of violation; we
would rather alert you early than start generating bogus simulations. There is also
increasingly good support for exporting to a robot format directly from CAD software
like Solidworks.
Now we have to import this robot description file into our physics engine. In Drake,
the physics engine is called MultibodyPlant. The term "plant" may seem odd -- it
is the word used in the controls literature to represent the physical system, with
origins in the control of chemical plants -- but it is pervasive. This connection is
also important to me; not many physics engines in the world go to the lengths that
Drake does to make the physics engine compatible with system-theoretic design and
analysis. You can access the methods of MultibodyPlant through its class methods,
e.g. for use in designing a model-based controller. But in order to work with our
model-based design philosophy, MultibodyPlant also provides an interface as a
System, with input and output ports.
applied_spatial_force→ → reaction_forces
model_instance_name[i]_actuation→ → contact_results
geometry_query→ → model_instance_name[i]_continuous_state
→
model_instance_name[i]_generalized_acceleration
→
model_instance_name[i]_generalized_contact_force
→ geometry_pose
As you might expect for something as complex and general as a physics engine,
it has many input and output ports; most of them are optional. I'll illustrate the
mechanics of using these in the following example.
Open in Colab
The best way to visualize the results of a physics engine is with a 2D or 3D visualizer.
For that, we need to add the system which curates the geometry of a scene; in Drake
we call it the SceneGraph. Once we have a SceneGraph, then there are a number of
different visualizers and sensors that we can add to the system to actually render
the scene.
source_pose{0}→
→ lcm_visualization
...→ SceneGraph
→ query
source_pose{N-1}→
Open in Colab
You might wonder why MultibodyPlant doesn't handle the geometry of the scene
as well. Well, there are many application where we'd like to render complex scenes,
and use complex sensors, but supply custom dynamics instead of the default physics
engine. Autonomous driving is a great example; in that case we want to populate a
SceneGraph with all of the geometry of the vehicles and environment, but we often
want to simulate the vehicles with very simple vehicle models that stop well short of
adding tire mechanics into our physics engine.
We now have a basic simulation of the iiwa, but already some subtleties emerge. The
physics engine needs to be told what torques to apply at the joints. In our example,
we apply zero torque, and the robot falls down. In reality, that never happens; in
fact there is essentially never a situation where the physical iiwa robot experiences
zero torque at the joints, even when the controller is turned off. Like many mature
industrial robot arms, iiwa has mechanical brakes at each joint that are engaged
whenever the controller is turned off. To simulate the robot with the controller
turned off, we would need to tell our physics engine about the torques produced by
these brakes.
In fact, even when the controller is turned on, and despite the fact that it is a
torque-controlled robot, we can never actually send zero torques to the motors. The
iiwa software interface accepts "feed-forward torque" commands, but it will always
add these as additional torques to its low-level controller which is compensating
for gravity and the motor/transmission mechanics. This often feels frustrating, but
probably we don't actually want to get into the details of simulating the drive
mechanics.
As a result, the simplest reasonable simulation we can provide of the iiwa must
include a simulation of their low-level controller. We will use the iiwa's "joint
impedance control" mode, and will describe the details of that once they become
important for getting the robot to perform better. For now, we can treat it as given,
and produce our simplest reasonable iiwa simulation.
Open in Colab
As a final note, you might think that simulating the dynamics of the robot is overkill,
if our only goal is to simulate manipulation tasks where the robot is moving only
relatively slowly, and effects of mass, inertia and forces might be less important than
just the positions that the robot (and the objects) occupy in space. I would actually
agree with you. But it's surprisingly tricky to get a kinematic simulation to respect
the basic rules of interaction; e.g. to know when the object gets picked up or when
it does not (see, for instance [7]). Currently, in Drake, we mostly use the full physics
engine for simulation, but often use simpler models for manipulation planning and
control.
2.2 HANDS
You might have noticed that the iiwa model does not actually have a hand attached;
the robot ships with a mounting plate so that you can attach the "end-effector" of
your choice (and some options on access ports so you can connect your end-effector
to the computer without wires running down the outside of the robot). So now we
have another decision to make: what hand should we use?
Open in Colab
It is interesting that, when it comes to robot end effectors, researchers in
manipulation tend to partition themselves into a few distinct camps.
Figure 2.3 - Dexterous hands. Left: the Shadow Dexterous Hand. Right: the Allegro
Hand.
Of course, our fascination with the human hand is well placed, and we dream of
building robotic hands that are as dexterous and sensor-rich. But the reality is that
we aren't there yet. Some people choose to pursue this dream and work with the
best dexterous hands on the market, and struggle with the complexity and lack of
robustness that ensues. The famous "learning dexterity" project from OpenAI used
the Shadow hand for playing with a Rubik's cube, and the work that had to go into
the hand in order to support the endurance learning experiments was definitely a
part of the story. There is a chance that new manufacturing techniques could really
disrupt this space -- videos like this one of FLLEX v2 look amazing[8] -- and I am
very optimistic that we'll have more capable and robust dexterous hands in the not-
so-distant future.
Figure 2.4 - This video of tele-operation with the PR1 from Ken Salisbury's group is
now a classic example of doing amazingly useful things with a very simple hand.
Check out their website for more videos, including sweeping, fetching a beer, and
unloading a dishwasher.
Another important argument in favor of simple hands is the elegance and clarity
that comes from reducing the complexity. If thinking clearly about simple grippers
helps us understand more deeply why we need more dexterous hands (I think it will),
then great. For most of these notes, a simple two-fingered gripper will serve our
pedagogical goals the best. In particular, I've selected the Schunk WSG 050, which
we have used extensively in our research over the last few years. We'll also explore a
few different end-effectors in later chapters, when they help to explain the concepts.
To be clear: just because a hand is simple (few degrees of freedom) does not
mean that it is low quality. On the contrary, the Schunk WSG is a very high-quality
gripper with force control and force measurement at its single degree of freedom
that surpasses the fidelity of the Kuka. It would be hard to achieve the same in a
dexterous hand with many joints.
Finally, the third and newest camp is promoting clever mechanical designs for
hands, which we'll call "underactuated hands", that can achieve some of the
advantages of a human hand based on mechanics alone. But often these hands are
amazingly good at some range of tasks (most often "enveloping grasps"), but not as
general purpose. It would be very hard to use one of these to, for instance, button
my shirt.
I particularly liked the response that Matt Mason, one of the main advocates for
simple grippers throughout the years, gave to a question at the end of one of our
robotics seminars: he argued that useful robots in e.g. the kitchen will probably have
special purpose tools that can be changed quickly.
One time I was attending an event where the registration form asked us "what is
your favorite robot of all time, real or fictional". That is a tough question for someone
who loves robots! But the answer I gave was a super cool "high-speed multifingered
hand" by the Ishikawa group; a project that started turning out amazing results back
in 2004! They "overclocked" the hand -- sending more current for short durations
than would be reasonable for any longer applications -- and also used high-speed
cameras to achieve these results. And they had a Rubik's cube demo, too, in 2017.
So good!
2.3 SENSORS
I haven't said much yet about sensors. In fact, sensors are going to be a major topic
for us when we get to perception with (depth) cameras, and when we think about
tactile sensing. But I will defer those topics until we need them.
For now, let us focus on the joint sensors on the robot. Both the iiwa and the Schunk
WSG provide joint feedback -- the iiwa driver gives "measured position", "estimated
velocity", and "measured torque" at each of its seven joints; remember that joint
accelerations are typically considered too noisy to rely on. Similarly the Schunk WSG
outputs "measured state" (position + velocity) and "measured force". We can make
all of these available as ports in a block diagram.
2.4 PUTTING IT ALL TOGETHER
If you've worked through the examples, you've seen that a proper simulation of our
robot is more than just a physics engine -- it requires assembling physics, actuator
and sensor models, and low-level robot controllers into a common framework. In
practice, in Drake, that means that we are assembling increasingly sophisticated
block diagrams.
One of the best things about the block-diagram modeling paradigm is the power
of abstraction and encapsulation. We can assemble a Diagram that contains all of
the components necessary to simulate our hardware platform and its environment,
which we will refer to affectionately as the "Manipulation Station". This diagram
itself can then be used as a System in additional diagrams, which can include
our perception, planning, and, higher-level control systems. All together, the
ManipulationStation system looks like this:
→ iiwa_position_commanded
→ iiwa_position_measured
→ iiwa_velocity_estimated
→ iiwa_state_estimated
→ iiwa_torque_commanded
→ iiwa_torque_measured
→ iiwa_torque_external
→ wsg_state_measured
→ wsg_force_measured
iiwa_position→
→ camera_[NAME]_rgb_image
iiwa_feedforward_torque
(optional)→ →
ManipulationStation camera_[NAME]_depth_image
wsg_position→
→
wsg_force_limit camera_[NAME]_label_image
(optional)→
→ ...
→ camera_[NAME]_rgb_image
→
camera_[NAME]_depth_image
→
camera_[NAME]_label_image
→ query_object
→ contact_results
→ plant_continuous_state
→ geometry_poses
This model also defines the abstraction between the simulation and the real
hardware. We offer an almost identical system, the
ManipulationStationHardwareInterface. If you replace this directly in place of the
ManipulationStation, then the same code you've developed in simulation can be
run directly on the real robot. The ports that are available only in simulation, but not
in reality, are colored orange on the ManipulationStation system.
→ iiwa_position_commanded
→ iiwa_position_measured
→ iiwa_velocity_estimated
→ iiwa_torque_commanded
→ iiwa_torque_measured
iiwa_position→ → iiwa_torque_external
iiwa_feedforward_torque→ → wsg_state_measured
wsg_position→ ManipulationStationHardwareInterface
→ wsg_force_measured
wsg_force_limit → camera_[NAME]_rgb_image
(optional)→
→
camera_[NAME]_depth_image
→ ...
→ camera_[NAME]_rgb_image
→
camera_[NAME]_depth_image
If you do have your own similar robot hardware available, and want to run the
hardware interface on your machines, I've started putting together a list of drivers
and bill of materials in the appendix.
2.5 EXERCISES
a. Learn how to probe into inputs and output ports of the manipulation
station and evaluate their contents.
b. Explore what different ports correspond to by probing their values.
REFERENCES
[1] Haruhiko Asada and Kamal Youcef-Toumi. Direct-Drive Robots - Theory and
Practice. The MIT Press, 1987.
[2] P. M. Wensing, A. Wang, S. Seok, D. Otten, J. Lang, and S. Kim. Proprioceptive
actuator design in the mit cheetah: Impact mitigation and high-bandwidth
physical interaction for dynamic legged robots. IEEE Transactions on Robotics,
33(3):509--522, June 2017.
[3] Navvab Kashiri, Jörn Malzahn, and Nikos Tsagarakis. On the sensor design of
torque controlled actuators: A comparison study of strain gauge and encoder
based principles. IEEE Robotics and Automation Letters, PP, 02 2017.
[4] Armin Wedler, Maxime Chalon, Klaus Landzettel, Martin Görner, Erich Krämer,
Robin Gruber, Alexander Beyer, Hans-Jürgen Sedlmayr, Bertram Willberg,
Wieland Bertleff, Josef Reill, Markus Schedl, Alin Albu-Schaffer, and Gerd
Hirzinger. Dlrs dynamic actuator modules for robotic space applications. In
Proceedings of the 41st Aerospace Mechanisms Symposium, May 16-18 2012.
[5] G. A. Pratt and M. M. Williamson. Series elastic actuators. In Proceedings 1995
IEEE/RSJ International Conference on Intelligent Robots and Systems. Human
Robot Interaction and Cooperative Robots, volume 1, pages 399--406 vol.1, Aug
1995.
[6] Alin Albu-Schaffer, Christian Ott, and Gerd Hirzinger. A unified passivity-based
control framework for position, torque and impedance control of flexible joint
robots. The international journal of robotics research, 26(1):23--39, 2007.
[7] Tao Pang and Russ Tedrake. A robust time-stepping scheme for quasistatic rigid
multibody systems. In IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), 2018. [ www: ]
[8] Yong-Jae Kim, Junsuk Yoon, and Young-Woo Sim. Fluid lubricated dexterous
finger mechanism for human-like impact absorbing capability. IEEE Robotics and
Automation Letters, 4(4):3971--3978, 2019.
CHAPTER 3 Open in Colab
Figure 3.1 - Your challenge: command the robot to pick up the brick and place it in
a desired position/orientation.
The stage is set. You have your robot. I have a little red foam brick. I'm going to
put it on the table in front of your robot, and your goal is to move it to a desired
position/orientation on the table. I want to defer the perception problem for one
chapter, and will let you assume that you have access to a perfect measurement of
the current position/orientation of the brick. Even without perception, completing
this task requires us to build up a basic toolkit for geometry and kinematics; it's a
natural place to start.
First, we will establish some terminology and notation for kinematics. This is one
area where careful notation can yield dividends, and sloppy notation will inevitably
lead to confusion and bugs. The Drake developers have gone to great length to
establish and document a consistent multibody notation, which we call "Monogram
Notation". The documentation even includes some of the motivation/philosophy
behind that notation. I'll use the monogram notation throughout this text.
If you'd like a more extensive background on kinematics than what I provide here,
my favorite reference is still [1]. For free online resources, Chapters 2 and 3 of the
1994 book by Murray et al. (now free online)[2] are also excellent, as are the first
seven chapters of Modern Robotics by Lynch and Park[3] (they also have excellent
accompanying videos). Unfortunately, with three different references you'll get three
(slightly) different notations; ours is most similar to [0].
Please don't get overwhelmed by how much background material there is to know!
I am personally of the opinion that a clear understanding of just a few basic ideas
should make you very effective here. The details will come later, if you need them.
Perhaps the most fundamental concept in geometry is the concept of a point. Points
occupy a position in space, and they can have names, e.g. point A, C , or more
descriptive names like Bcm for the center of mass of body B. We'll denote the position
of the point by using a position vector pA ; that's p for position, and not for point,
because other geometric quantities can also have a position.
But let's be more careful. Position is actually a relative quantity. Really, we should
only ever write the position of two points relative to each other. We'll use e.g. A pC to
denote the position of C relative to A. The left superscript looks mighty strange, but
we'll see that it pays off once we start transforming points.
That is seriously heavy notation. I don't love it myself, but it's the most durable I've
got, and we'll have shorthand for when the context is clear.
There are a few very special frames. We use W to denote the world
frame. We think about the world frame in Drake using vehicle
coordinates (positive x to the front, positive y to the left, and positive z
is up). The other particularly special frames are the body frames: every
body in the multibody system has a unique frame attached to it. We'll
typically use Bi to denote the frame for body i.
Frames have a position, too -- it coincides with the frame origin. So it is perfectly
valid to write W pA
W to denote the position of point A relative to the origin of the world
frame, expressed in the world frame. Here is where the shorthand comes in. If the
position of a quantity is relative to a frame, and expressed in the same frame, then
we can safely omit the subscript. F pA ≡ F pA
F . Furthermore, if the "relative to" field is
omitted, then we assume that the point is relative to W , so pA ≡W pA W.
Frames also have an orientation. We'll use R to denote a rotation, and follow the
same notation, writing B RA to denote the rotation of frame A relative to frame B.
Unlike vectors, pure rotations do not have an additional "expressed in" frame.
The Drake documentation also discusses how to use this notation in code. In short,
pC is written p_BA_C. It works, I promise.
B A
Now that we have the notation, we can formulate our approach to the basic pick and
place problem. Let us call our object, O, and our gripper, G. Our idealized perception
sensor tells us W X O . Let's create a frame Od to describe the "desired" pose of the
object, W X Od . So pick and place manipulation is simply trying to make X O = X Od .
To accomplish this, we will assume that the object doesn't move relative to the world
(W X O is constant) when the gripper is open, and the object doesn't move relative to
the gripper (G X O is constant) when the gripper is closed. Then we can:
• move the gripper in the world, X G , to an appropriate pose relative to the object:
O Ggrasp
X .
• close the gripper.
• move the gripper+object to the desired pose, X O = X Od .
• open the gripper, and retract the hand.
To simplify the problem of approaching the object (without colliding with it) and
retracting from the object, we will insert a "pregrasp pose", O X Gpregrasp , above the
object as an intermediate step.
Clearly, programming this strategy requires good tools for working with these
transforms, and for relating the pose of the gripper to the joint angles of the robot.
Here are the basic rules of algebra for our spatial quantities:
• Positions expressed in the same frame can be added when their reference
and target symbols match:
A B
pF + B pC A C
F = pF . (1)
Those should be pretty intuitive; make sure you confirm them for yourself.
• Multiplication by a rotation can be used to change the "expressed in" frame:
A B
pG = G RF A pB
F. (3)
• Transforms compose:
A
X BB X C = A X C , (7)
In practice, transforms are implemented using homogenous coordinates, but for now
I'm happy to leave that as an implementation detail.
A depth camera returns points in the camera frame. Therefore, we'll write this
position of point Pi with C pPi . If we want to convert the point into the world frame,
we simply have
pPi = X C C pPi .
This is a work-horse operation for us. We often aim to merge points from multiple
cameras (typically in the world frame), and always need to somehow relate the
frames of the camera with the frames of the robot.
The spatial algebra gets us pretty close to what we need for our pick and place
algorithm. But remember that the interface we have with the robot reports
measured joint positions, and expects commands in the form of joint positions. So
our remaining task is to convert between joint angles and cartesian frames. We'll do
this in steps, the first step is to go from joint positions to cartesian frames: this is
known as forward kinematics.
Throughout this text, we will refer to the joint positions of the robot (also known
as "configuration" of the robot) using a vector q. If the configuration of the scene
includes objects in the environment as well as the robot, we would use q for
the entire configuration vector, and use e.g. qrobot for the subset of the vector
corresponding to the robot's joint positions. Therefore, the goal of forward
kinematics is to produce a map:
G
X G = fkin (q). (9)
Moreover, we'd like to have forward kinematics available for any frame we have
defined in the scene. Our spatial notation and spatial algebra makes this
computation relatively straight-forward.
Open in Colab
Every Joint and "floating base" has some number of position variables associated
with it -- a subset of the configuration vector q -- and knows how to compute the
configuration dependent transform across the joint from the child joint frame JC
to the parent joint frame JP : JP X JC (q). Additionally, the kinematic tree defines the
(fixed) transforms from the joint frame to the child body frame, C X JC , and from the
joint frame to the parent frame, P X JP . Altogether, we can compute the configuration
transform between any one body and its parent,
P
X C (q) = P X JP JP X JC (q)JC X C .
You might be tempted to think that every time you add a joint to the
MultibodyPlant, you are adding a degree of freedom. But it actually works the other
way around. Every time you add a body to the plant, you are adding many degrees of
freedom. But you can then add joints to remove those degrees of freedom; joints are
constraints. "Welding" the robot's base to the world frame removes all of the floating
degrees of freedom of the base. Adding a rotational joint between a child body and a
parent body removes all but one degree of freedom, etc.
Interestingly, SceneGraph (the geometry engine) does not maintain any tree
structure. All that SceneGraph knows about is a list of frames specified relative to
the world frame, W , and a list of geometries attached to each frame.
In order to compute the pose of the gripper in the world, X G , we simply query
the parent of the gripper frame in the kinematic tree, and recursively compose the
transforms until we get to the world frame.
Figure 3.3 - Kinematic frames on the iiwa (left) and the WSG (right). For each
frame, the positive x axis is in red, the positive y axis is in green, and the positive z
axis is in blue. It's (hopefully) easy to remember: XYZ ⇔ RGB.
Open in Colab
gripper = plant.GetBodyByName("body")
pose = plant.EvalBodyPoseInWorld(context, gripper)
Behind the scenes, the MultibodyPlant is doing all of the spatial algebra we
described above to return the pose (and also some clever caching because you can
reuse much of the computation when you want to evaluate the pose of another
frame on the same robot).
X B = fkin
B
(q),
This gets into the subtle points of how we represent transforms, and how we
represent rotations in particular. There are many possible representations of 3D
rotations, they are good for different things, and unfortunately, there is no one
representation to rule them all. (This is one of the many reasons why everything
is better in 2D!) Common representations include 3x3 rotation matrices, Euler
angles (e.g. Roll-Pitch-Yaw), axis angle, and unit quaternions. In Drake, we provide
all of these representations, and make it easy to convert back and forth between
them. In order to make the spatial algebra efficient, we use rotation matrices
in our RigidTransform, but in order to have a more compact representation of
configuration we use unit quaternions in our configuration vector, q. You can think
of unit quaternions as a form of axis angles that have been carefully normalized
to be unit length and have magical properties. My favorite careful description of
quaternions is probably chapter 1 of [4].
As a result, for this example, the software implementation of the function fkin
B
is
precisely the function that converts the position × unit quaternion representation
into the position × rotation matrix representation.
All of the subtlety, again, comes in because of the multiple representations that
we have for 3D rotations (rotation matrix, unit quaternions, ...). While there is
no one best representation for 3D rotations, it is possible to have one canonical
representation for differential rotations. Therefore, without any loss of generality,
we can represent the rate of change in pose using a six-component vector for spatial
velocity:
A B
A ωC
VCB = [ ]. (11)
A B
vC
A
VCB is the spatial velocity of frame B relative to frame A expressed in frame C ,
ωC ∈ R3 is the angular velocity (of frame B relative frame A expressed in frame C
A B
positions). Translational and rotational velocities are simply vectors, so they can be
operated on just like positions:
and have the additive inverse. For the translational velocity, the addition is
perhaps known from high-school physics; for the angular velocity I consider
it quite surprising, and deserves to be verified.
• Rotation matrices can be used to change between the "expressed-in" frames
when those frames are fixed:
A B F
vG = G RF A vB
F,
A B
ωG = G RF A ωB
F, when G R
˙ = 0. (13)
• If the relative pose between two frames is not fixed, then we get slightly
more complicated formulas involving vector cross products; they are
derived from the derivative of the transform equation and the chain rule.
(Click the triangle to expand those details, but consider skipping them
for now!) The angular velocity vector is related to the time-derivative of
a rotation matrix via the skew-symmetric (as verified by differentiating
RRT = I ) matrix:
⎡ 0 −ωz ωy ⎤
˙ T = RR
RR ˙ −1 = ωz 0 −ωx . (14)
⎣ ⎦
−ωy ωx 0
˙ T p = RR
RR ˙ −1 p = ω × p. (15)
Therefore, differentiating
G A
p = G X F F pA = G pF + G RF F pA ,
yields
G A ˙ F F pA + G RF F vA
v =G vF + G R
=G vF + G ωF × G RF F pA + G RF F vA . (16)
There is one more velocity to be aware of: I'll use v to denote the generalized
velocity vector of the plant. While a spatial velocity A V B is six components, a
translational or angular velocity, B vC or B ωC , is three components, the generalized
velocity vector is whatever size it needs to be to encode the time derivatives of the
configuration variables, q. For the iiwa welded to the world frame, that means it has
seven components. I've tried to be careful to typeset each of these v's differently
throughout the notes. Almost always the distinction is also clear from the context.
But there is one minor nuisance that this causes. We tend to want to think of the
generalized velocity as the time derivative of the generalized positions. This works
when we have only our iiwa in the model, and it is welded to the world frame. But
we cannot assume this in general; not when floating-base rotations are concerned.
As evidence, here is a simple example that loads exactly one rigid body into the
MultibodyPlant, and then prints its Context.
Open in Colab
Context
--------
Time: 0
States:
13 continuous states
1 0 0 0 0 0 0 0 0 0 0 0 0
plant.num_positions() = 7
plant.num_velocities() = 6
You can see that this system has 13 total state variables. 7 of them are positions, q;
we use unit quaternions in the position vector. But we have just 6 velocities, v; we
use angular velocities in the the velocity vector. Clearly, if the length of the vectors
don't even match, we do not have q˙ = v.
It's not really any harder; Drake provides the MultibodyPlant methods
MapQDotToVelocity and MapVelocityToQDot to go back and forth between them.
But you have to remember to use them!
• CalcJacobianAngularVelocity,
• CalcJacobianTranslationalVelocity, and
• CalcJacobianSpatialVelocity,
with each taking an argument to specify whether you'd like the Jacobian with respect
to q˙ or v. If you really like the analytical Jacobian, you could get it (much less
efficiently) using our support for automatic differentiation.
Open in Colab
There is important structure in Eq (10). Make sure you didn't miss it. The
relationship between joint velocities and end-effector velocities is (configuration-
dependent) linear:
V G = J G (q)v. (17)
Maybe this gives us what we need to produce changes in gripper frame G? If I have
a desired gripper frame velocity V Gd , then how about commanding a joint velocity
v = [J G (q)] V Gd ?
−1
Any time you write a matrix inverse, it's important to check that the matrix is
actually invertible. As a first sanity check: what are the dimensions of J G (q)? We
know the spatial velocity has six components. Our gripper frame is welded directly
on the last link of the iiwa, and the iiwa has seven positions, so we have
J G (qiiwa ) ∈ R6x7 . The matrix is not square, so does not have an inverse. But having
more degrees of freedom than the desired spatial velocity requires (more columns
than rows) is actually the good case, in the sense that we might have many solutions
for v that can achieve a desired spatial velocity. To choose one of them (the minimum-
norm solution), we can consider using the Moore-Penrose pseudo-inverse, J + ,
instead:
v = [J G (q)]+ V Gd . (18)
To keep things simple for this first example, we'll just command a constant gripper
spatial velocity, and only run the simulation for a few seconds.
Note that we do have to add one additional system into the diagram. The output
of our controller is a desired joint velocity, but the input that the iiwa controller is
expecting is a desired joint position. So we will insert an integrator in between.
I don't expect you to understand every line in this example, but it's worth finding
the important lines and making sure you can change them and see what happens!
Open in Colab
There is a simple check to understand when the pseudo-inverse can give an exact
solution for any spatial velocity (achieving exactly the desired spatial velocity): the
Jacobian must be full row-rank. In this case, we need rank(J) = 6. But assigning
an integer rank to a matrix doesn't tell the entire story; for a real robot with
(noisy) floating-point joint positions, as the matrix gets close to losing rank, (pseudo-
)inverse starts to "blow-up". A better metric, then, is to watch the smallest singular
value; as this approaches zero, the norm of the pseudo-inverse will approach infinity.
Open in Colab
Here's a hint: try some configurations where the arm is very straight, (e.g. driving
joint 2 and 4 close to zero).
Another good way to find the singularities are to use your pseudo-inverse
controller to send gripper spatial velocity commands that drive the gripper to the
limits of the robot's workspace. Try it and see! In fact, this is the common case,
and one that we will work hard to avoid.
Configurations q for which rank(J(qiiwa )) < 6 for a frame of interest (like our gripper
frame G) are called kinematic singularities. Try to avoid going near them if you can!
The iiwa has many virtues, but admittedly its kinematic workspace is not one of
them. Trust me, if you try to get a big Kuka to reach into a little kitchen sink all day,
every day, then you will spend a non-trivial amount of time thinking about avoiding
singularities.
In practice, things can get a lot better if we stop bolting our robot base to a fixed
location in the world. Mobile bases add complexity, but they are wonderful for
improving the kinematic workspace of a robot.
Example 3.11 (Are kinematic singularities real?)
A natural question when discussing singularities is whether they are somehow
real, or whether they are somehow an artifact of the analysis. Perhaps it is useful
to look at an extremely simple case.
Imagine a two-link arm, where each link has length one. Then the kinematics
reduce to
pG = [ c0 + c0+1 ],
s0 + s0+1
where I've used the (very common) shorthard s0 for sin(q0 ) and s0+1 for sin(q0 + q1 ),
etc. The Jacobian is
and as expected, it loses rank when the arm is at full extension (e.g. when
q0 = q1 = 0 which implies the first row is zero).
Let's move the robot along the x-axis, by taking q0 (t) = 1 − t, and q1 (t) = −2 + 2t
. This clearly visits the singularity q0 = q1 = 0 at time 1, and then leaves again
without trouble. In fact, it does all this with a constant joint velocity (q˙0 = −1, q˙1 = 2
)! The resulting trajectory is
There are a few things to understand. At the singularity, there is nothing that
the robot can do to move its gripper farther in positive x -- that singularity is
real. But it is also true that there is no way for the robot to move instantaneously
back in the direction of −x. The Jacobian analysis is not an approximation, it
is a perfect description of the relationship between joint velocities and gripper
velocities. However, just because you cannot achieve an instantaneous velocity in
the backwards direction, it does not mean you cannot get there! At t = 1, even
though the joint velocities are constant, and the Jacobian is singular, the robot is
accelerating in −x, p¨GWx (t) = −2 cos(1 − t).
I'm going to put my red foam brick on the table. Its geometry is defined as a 7.5cm
x 5cm x 5cm box. For reference, the distance between the fingers on our gripper in
the default "open" position is 10.7cm. The "palm" of the gripper is 3.625cm from the
body origin, and the fingers are 8.2cm long.
To make things easy to start, I'll promise to set the object down on the table with the
object frame's z-axis pointing up (aligned with the world z axis), and you can assume
it is resting on the table safely within the workspace of the robot. But I reserve the
right to give the object arbitrary initial yaw. Don't worry, you might have noticed that
the seventh joint of the iiwa will let you rotate your gripper around quite nicely (well
beyond what my human wrist can do).
Observe that visually the box has rotational symmetry -- I could always rotate the
box 90 degrees around its x-axis and you wouldn't be able to tell the difference.
We'll think about the consequences of that more in the next chapter when we start
using perception. But for now, we are ok using the omniscient "cheat port" from the
simulator which gives us the unambiguous pose of the brick.
Figure 3.5 - The gripper frame and the object frame. For each frame, the positive x
axis is in red, the positive y axis is in green, and the positive z axis is in blue (XYZ ⇔
RGB).
Take a careful look at the gripper frame in the figure above, using the colors to
understand the axes. Here is my thinking: Given the size of the hand and the object,
I want the desired position (in meters) of the object in the gripper frame,
⎡ 0 ⎤ ⎡0⎤
Ggrasp O Gpregrasp O
p = 0.12 , p = 0.2 .
⎣ ⎦ ⎣ ⎦
0 0
Recall that the logic behind a pregrasp pose is to first move to safely above the
object, if our only gripper motion that is very close to the object is a straight
translation from the pregrasp pose to the grasp pose and back, then it allows us to
mostly avoid having to think about collisions (for now). I want the orientation of my
gripper to be set so that the positive z axis of the object aligns with the negative y
axis of the gripper frame, and the positive x axis of the object aligns with the positive
z axis of the gripper. We can accomplish that with
Ggrasp π π
RO = MakeXRotation( )MakeZRotation( ).
2 2
I admit I had my right hand in the air for that one! Our pregrasp pose will have the
same orientation as our grasp pose.
Open in Colab
I hope you can see the value of having good notation at work here. My right hand
was in the air when I was deciding what a suitable relative pose for the object in
the hand should be (when writing the notes). But once that was decided, I went to
type it in and everything just worked.
We're getting close. We know how to produce desired gripper poses, and we know
how to change the gripper pose instantaneously using spatial velocity commands.
Now we need to specify how we want the gripper poses to change over time, so that
we can convert our gripper poses into spatial velocity commands.
Let's define all of the "keyframe" poses that we'd like the gripper to travel through,
and time that it should visit each one. The following colab example does precisely
that.
Figure 3.7 - Keyframes of the gripper. The robot's base will be at the origin, so
we're looking over the (invisible) robot's shoulder here. The hand starts in the
"initial" pose near the center, moves to the "prepick" to "pick" to "prepick" to
"clearance" to "preplace" to "place" and finally back to "preplace".
How did I choose the times? I started everything at time, t = 0, and listed the rest
of our times as absolute (time from zero). That's when the robot wakes up and
sees the brick. How long should we take to transition from the starting pose to the
pregrasp pose? A really good answer might depend on the exact joint speed limits
of the robot, but we're not trying to move fast yet. Instead I chose a conservative
time that is proportional to the total Euclidean distance that the hand will travel,
say k = 10 s/m (aka 10 cm/s):
G0 Gpregrasp
tpregrasp = k p 2
.
I just chose a fixed duration of 2 seconds for the transitions from pregrasp to grasp
and back, and also left 2 seconds with the gripper stationary for the segments
where the fingers needs to open/close.
To fill in the details -- defining a gripper pose for all t -- we'll convert these keyframes
into a proper trajectory, X G (t), ∀t ∈ [tinitial , tpostpick ]. Actually, we will produce a handful
of trajectories, one for each of the:
Open in Colab
0.5
x
0.0
0 5 10 15 20 25 30 35
t
0.0
y
−0.5
0 5 10 15 20 25 30 35
t
0.4
z
0.2
0.0
0 5 10 15 20 25 30 35
t
With 3D data, you can plot it in 3D. But my favorite approach is as an animation in
our 3D renderer! Make sure you try the "Open controls>Animation" interface. You
can pause it and then scrub through the trajectory using the time slider.
omega_WG→
v_WG→ PseudoInverseController → iiwa_velocity_command
iiwa_position→
So far, our trajectories are position and orientation trajectories. Our controller needs
translational velocity and angular velocity commands. Fortunately, the trajectory
classes we have used support differentiating the trajectories. In fact, the
PiecewiseQuaternionSlerp is clever enough to return the derivative of the
4-component quaternion trajectory as a 3-component angular velocity trajectory. The
rest is just a matter of wiring up the system diagram.
Open in Colab
pick_and_place
omega_WG
y0
PseudoInverseController
v_WG
omega_WG
y0
v_WG iiwa_velocity
integrator
iiwa_position
manipulation_station u0 y0
iiwa_position_commanded
iiwa_position_measured
iiwa_position
iiwa_velocity_estimated
iiwa_state_estimated
iiwa_torque_commanded
iiwa_torque_measured
iiwa_feedforward_torque
wsg_state_measured
wsg_force_measured
iiwa_torque_external
wsg_command
camera_0_rgb_image
y0 wsg_position
camera_0_depth_image
camera_0_label_image meshcat_visualizer
query_object lcm_visualization
contact_results
wsg_force_limit
plant_continuous_state
geometry_poses
It's worth scrutinizing the result. For instance, if you examine the context at the
final time, how close did it come to the desired final gripper position? Are you
happy with the joint positions? If you ran that same trajectory in reverse, then back
and forth (as an industrial robot might), would you expect errors to accumulate?
Our solution above works in many cases. We could potentially move on. But with just
a little more work, we can get a much more robust solution... one that we will be
happy with for many chapters to come.
So what's wrong with the pseudo-inverse controller? You won't be surprised when I
say that it does not perform well around singularities. When the minimum singular
value of the Jacobian gets small, that means that some values in the inverse get very
large. If you ask our controller to track a seemingly reasonable end-effector spatial
velocity, then you might have extremely large velocity commands that result.
There are other important limitations, though, which are perhaps more subtle.
The real robot has constraints, very real constraints on the joint angles, velocities,
accelerations, and torques. If you, for instance, send a velocity command to the iiwa
controller that cannot be followed, that velocity will be clipped. In the mode we
are running the iiwa (joint-impedance mode), the iiwa doesn't know anything about
your end-effector goals. So it will very likely simply saturate your velocity commands
independently joint by joint. The result, I'm afraid, will not be as convenient as a
slower end-effector trajectory. Your end-effector could run wildly off course.
Since we know the limits of the iiwa, a better approach is to take these constraints
into account at the level of our controller. It's relatively straight-forward to take
position, velocity, and acceleration constraints into account; torques would require
a full dynamics model so we won't worry about them yet here.
When I write an optimization of this form, I will refer to v as the decision variable(s),
and I will use v∗ to denote the optimal solution (the value of the decision variables
that minimizes the cost). Here we have
+
v∗ = [J G (q)] V Gd .
Optimization is an incredibly rich topic, and we will put many tools from optimization
theory to use over the course of this text. For a beautiful rigorous but accessible
treatment of convex optimization, I highly recommend [5]; it is free online and even
reading the first chapter can be incredibly valuable. For a very short introduction
to using optimization in Drake, please take a look at the tutorials on "Solving
Mathematical Programs" linked from the Drake front page. I use the term
"mathematical program" almost synonymously with "optimization problem".
Mathematical program is slightly more appropriate if we don't actually have an
objective; only constraints.
You can read this as "find me the joint velocities that achieve my desired gripper
spatial velocity as closely as possible, but satisfy my joint velocity constraints."
The solution to this can be much better than what you would get from solving the
unconstrained optimization and then simply trimming any velocities to respect the
constraints after the fact.
This is, admittedly, a harder problem to solve in general. The solution cannot be
described using only the pseudo-inverse of the Jacobian. Rather, we are going to
solve this (small) optimization problem directly in our controller every time it is
evaluated. This problem has a convex quadratic objective and linear constraints, so
it falls into the class of convex Quadratic Programming (QP). This is a particularly
nice class of optimization problems where we have very strong numerical tools.
We can easily add more constraints to our QP, without significantly increasing the
complexity, as long as they are linear in the decision variables. So how should we
add constraints on the joint position and acceleration?
The controller already accepts the current measured joint positions q as an input;
let us now also take the current measured joint velocities v as a second input. And
we'll use vn for our decision variable -- the next velocity to command. Using a simple
Euler approximation of position and first-order derivative for acceleration gives us
the following optimization problem:
2
min J G (q)vn − V Gd 2 , (21)
vn
subject to vmin ≤ vn ≤ vmax ,
qmin ≤ q + hvn ≤ qmax ,
vn − v
v˙min ≤ ≤ v˙max .
h
Here is one way to formulate it. If we had no end-effector goals, then we might write
the controller v = K(q0 − q), a proportional control that would drive the robot to its
nominal configuration. In the unconstrained control case, there is a very elegant way
of implementing this controller as a "secondary" control, guaranteed not to interfere
with our primary end-effector spatial velocity controller, by projecting it into the
nullspace of J G . We can use almost the same trick here.
There are more sophisticated methods if one wishes to establish a strict task
prioritization in the presence of constraints (e.g. [6, 7]), but for th simple
prioritization we have formulated here, the penalty method is quite reasonable.
Once we have embraced the idea of solving a small optimization problem in our
control loop, many other formulations are possible, too. You will find many in the
literature. Minimizing the least-squares distance between the commanded spatial
velocity and the resulting velocity might not actually be the best solution. The
formulation we have been using heavily in Drake adds an additional constraint that
our solution must move in the same direction as the commanded spatial velocity.
If we are up against constraints, then we may slow down, but we will not deviate
(instantaneously) from the commanded path. It would be a pity to spend a long
time carefully planning collision-free paths for your end-effector, only to have your
controller treat the path as merely a suggestion. Note, however, that your plan
playback system still needs to be smart enough to realize that the slow-down
occurred (open-loop velocity commands are not enough).
max α, (23)
vn ,α
subject to J G (q)vn = αV Gd ,
0 ≤ α ≤ 1,
additional constraints.
I've rendered the gripper and the brick with their corresponding body frames.
Given the configuration displayed in the figure, which is a possible value for G pO ?
a. [0.2, 0, -.2]
b. [0, 0.3, .1]
c. [0, -.3, .1]
figures/scaling_spatial_velocity.py
I think you'll find that solution is quite close to x∗ = 5, but also that y∗ is quite
different than that y∗ = 0 one would obtain if the "secondary" objective was
omitted completely.
Note that if the constraint was not active at the optimal solution (e.g. x = 5, y = −3
satisfied the constraints), then the objectives do not compete.
This seems like an overly simple example. But I think you'll find that it is actually
quite similar to what is happening in the nullspace objective formulation above.
REFERENCES
The approach that we'll take here is very geometric. This is in contrast to, and very
complimentary with, approaches that are more fundamentally driven by data. There
is no question that deep learning has had an enormous impact in perception -- it's
fair to say that it has enabled the current surge in manipulation advances -- and we
will certainly cover it in these notes. But when I've heard colleagues say that "all
perception these days is based on deep learning", I can't help but cry foul. There
has been another surge of progress in the last few years that has been happening in
parallel: the revolution in geometric reasoning, fueled by applications in autonomous
driving and virtual/augmented reality in addition to manipulation. Most advanced
manipulation systems these days combine both "deep perception" and "geometric
perception".
Just as we had many choices when selecting the robot arm and hand, we have many
choices for instrumenting our robot/environment with sensors. Even more so than
our robot arms, the last few years have seen incredible improvements in the quality
and reductions in cost and size for these sensors. This is largely thanks to the cell
phone industry, but the race for autonomous cars has been fueling high-end sensors
as well.
One might think that the most important sensors for manipulation are the touch
sensors (you might even be right!). But in practice today, most of the emphasis is
on camera-based and/or range sensing. At very least, we should consider this first,
since our touch sensors won't do us much good if we don't know where in the world
we need to touch.
Primarily RGB-D (ToF vs projected texture stereo vs ...) cameras and Lidar
The cameras we are using in this course are Intel RealSense D415....
Monocular depth.
4.1.2 Simulation
There are a number of levels of fidelity at which one can simulate a camera like the
D415. We'll start our discussion here using an "ideal" RGB-D camera simulation --
the pixels returned in the depth image represent the true geometric depth in the
direction of each pixel coordinate. In DRAKE that is represented by the RgbdSensor
system, which can be wired up directly to the SceneGraph.
→ color_image
→ depth_image_32f
geometry_query→ RgbdSensor → depth_image_16u
→ label_image
→ X_WB
In the ManipulationStation simulation of the entire robot system for class, the
RgbdSensors have already been added, and their output ports exposed as outputs
for the station.
Real depth sensors, of course, are far from ideal -- and errors in depth returns
are not simple Gaussian noise, but rather are dependent on the lighting conditions,
the surface normals of the object, and the visual material properties of the object,
among other things. The color channels are an approximation, too. Even our high-
end renderers can only do as well as the geometries, materials and lighting sources
that we add to our scene, and it is very hard to capture all of the nuances of a
real environment. We will examine real sensor data, and the gaps between modeled
sensors and real sensors, after we start understanding some of the basic operations.
Figure 4.2 - Real depth sensors are messy. The red pixels in the far right image are
missed returns -- the depth sensor returned "maximum depth". Missed returns,
especially on the edges of objects (or on reflective objects) are a common sighting
in raw depth data. This figure is reproduced from [1].
The view of the mustard bottle in the example above makes one primary challenge
of working with cameras very clear. Cameras don't get to see everything! They only
get line of sight. If you want to get points from the back of the mustard bottle, you'll
need to move the camera. And if the mustard bottle is sitting on the table, you'll have
to pick it up if you ever want to see the bottom. This situation gets even worse in
cluttered scenes, where we have our views of the bottle occluded by other objects.
Even when the scene is not cluttered, a head-mounted or table-mounted camera is
often blocked by the robot's hand when it goes to do manipulation -- the moments
when you would like the sensors to be at their best is often when they give no
information!
It is quite common to mount a depth camera on the robot's wrist in order to help
with the partial views. But it doesn't completely resolve the problem. All of our depth
sensors have both a maximum range and a minimum range. The RealSense D415
returns matches from about 0.3m to 1.5m. That means for the last 30cm of the
approach to the object -- again, where we might like our cameras to be at their best
-- the object effectively disappears!
Figure 4.3 - When luxury permits, we try to put as many cameras as we can into the
robot's environment. In this dish-loading testbed at TRI where we put a total of
eight cameras into the scene (it still doesn't resolve the occlusion problem).
Depth image, point cloud, triangulated mesh, signed distance functions, (+ surfels,
occupancy maps, ...)
The data returned by a depth camera takes the form of an image, where each
pixel value is a single number that represents the distance between the camera
and the nearest object in the environment along the pixel direction. If we combine
this with the basic information about the camera's intrinsic parameters (e.g. lens
parameters, stored in the CameraInfo class in Drake) then we can transform this
depth image representation into a collection of 3D points, si . I use s here because
they are commonly referred to as the "scene points" in the algorithms we will
present below. These points have a pose and (optionally) a color value or other
information attached. This is known as a point cloud representation. By default,
their pose is described in the camera frame, C X si , but the DepthImageToPointCloud
system in Drake also accepts P X C to make it easy to transform them into another
frame (such as the world frame).
depth_image→
color_image (optional)→ DepthImageToPointCloud → point_cloud
camera_pose (optional)→
As depth sensors have become so pervasive the field has built up libraries of tools
for performing basic geometric operations on point clouds, and that can be used to
transform back and forth between representations. We can implement many of of
the basic operations directly in Drake, but we will use the Open3D library if we need
more. Many older projects used the Point Cloud Library (PCL), which is now defunct
but still has some very useful documentation.
It's important to realize that the conversion of a depth image into a point cloud does
lose some information -- specifically the information about the ray that was cast from
the camera to arrive at that point. In addition to declaring "there is geometry at
this point", the depth image combined with the camera pose also implies that "there
is no geometry in the straight line path between camera and this point". We will
make use of this information in some of our algorithms, so don't discard the depth
images completely! More generally, we will find that each of the different geometry
representations have strengths and weaknesses -- it is very common to keep multiple
representations around and to convert back and forth between them.
4.3 POINT
CLOUD REGISTRATION WITH KNOWN
CORRESPONDENCES
Let us begin to address the primary objective of the chapter -- we have a known
object somewhere in the robot's workspace, we've obtained a point cloud from our
depth cameras. How do we estimate the pose of the object, X O ?
Our second point cloud will be the scene points, si , that we obtain from our depth
camera, transformed (via the camera intrinsics) into the camera coordinates, C X si .
I'll use Nm and Ns for the number of model and scene points, respectively.
Let us assume, for now, that we also know X C . Is this reasonable? In the case of
a wrist-mounted camera, this would amount to solving the forward kinematics of
the arm. In an environment-mounted camera, this is about knowing the pose of the
cameras in the world. Small errors in estimating those poses, though, can lead to
large artifacts in the estimated point clouds. We therefore take great care to perform
camera extrinsics calibration; I've linked to our calibration code in the appendix.
Note that if you have a mobile manipulation platform (an arm mounted on a moving
base), then all of these discussions still apply, but you likely will perform all of your
registration in the robot's frame instead of the world frame.
The second major assumption that we will make in this section is "known
correspondences". This is not a reasonable assumption in practice, but it helps us
get started. The assumption is that we know which scene point corresponds with
which model point. I'll use a correspondence vector c ∈ [1, Nm ]Ns , where ci = j denotes
that scene point si corresponds with model point mj . Note that we do not assume
that these correspondences are one-to-one. Every scene point corresponds to some
model point, but the converse is not true, and multiple scene points may correspond
to the same model point.
leaving a single unknown transform, X O , for which we must solve. In the previous
chapter I argued for using differential kinematics instead of inverse kinematics; why
is my story different here? Differential kinematics can still be useful for perception,
for example if we want to track the pose of an object after we acquire its initial pose.
But unlike the robot case, where we can read the joint sensors to get our current
state, in perception we need to solve this harder problem at least once.
What does a solution for X O look like? Each model point gives us three constraints,
when psi ∈ R3 . The exact number of unknowns depends on the particular
representation we choose for the pose, but almost always this problem is
dramatically over-constrained. Treating each point as hard constraints on the
relative pose would also be very susceptible to measurement noise. As a result, it will
serve us better to try to find a pose that describes the data, e.g., in a least-squares
sense:
Ns
min ∑ ∥X O O pmci − X C C psi ∥2 .
X O ∈SE(3)
i=1
Here I have used the notation SE(3) for the "special Euclidean group," which is the
standard notation saying the X O must be a valid rigid transform.
To proceed, let's pick a particular representation for the pose to work with. I will use
3x3 rotation matrices here; the approach I'll describe below also has an equivalent
in quaternion form[2]. To write the optimization above using the coefficients of the
rotation matrix and the translation as decision variables, I have:
Ns
min ∑ ∥pO + RO O pmci − X C C psi ∥2 , (1)
pO ,RO
i=1
subject to R = R−1 ,
T
det(R) = +1.
The constraints are needed because not every 3x3 matrix is a valid rotation matrix;
satisfying these constraints ensures that R is a valid rotation matrix. In fact, the
constraint RT = R−1 is almost enough by itself; it implies that det(R) = ±1. But a
matrix that satisfies that constraint with det(R) = −1 is called an "improper rotation",
or a rotation plus a reflection across the axis of rotation.
Let's think about the type of optimization problem we've formulated so far. Given
our decision to use rotation matrices, the term pO + ROO pmci − X C C psi is linear in
the decision variables (think Ax ≈ b), making the objective a convex quadratic. How
about the constraints? The first constraint is a quadratic equality constraint (to see
it, rewrite it as RRT = I which gives 9 constraints, each quadratic in the elements of
R). The determinant constraint is cubic in the elements of R.
In 2D, we can actually linearly parameterize a rotation matrix with just two
variables:
R = [a −b].
b a
Is that what you expected? I generated this plot using just two model/scene points,
but adding more will only change the shape of the quadratic, but not its minimum.
And on the interactive version of the plot, I've added a slider so that you control
the parameter, θ, that represents the ground truth rotation described by R. Try it
out!
In the case with no noise in the measurements (e.g. the scene points are exactly
the model points modified by a rotation matrix), then the minimum of the objective
function already satisfies the constraint. But if you add just a little noise to the
points, then this is no longer true, and the constraint starts to play an important
role.
The geometry should be roughly the same in 3D, though clearly in much higher
dimensions. But I hope that the plot makes it perhaps a little less surprising that
this problem has an elegant numerical solution based on the singular value
decomposition (SVD).
If you are curious, these come from taking the gradient of the Lagrangian with
respect to p, and setting it equal to zero:
Ns
1 Ns si 1 Ns O mc
∑ 2(p + R O pmci − psi ) = 0 ⇒ p∗ = ∑ p − R∗ ( ∑ p i ),
i=1
Ns i=1 Ns i=1
but don't forget the geometric interpretation above. This is just another way to see
that we can substitute the right-hand side in for p in our objective and solve for R by
itself.
R∗ = UDV T ,
p = ps¯ − R∗O pm¯ ,
∗
where D is the diagonal matrix with entries [1, 1, det(UV T )] [3]. This may seem
like magic, but replacing the singular values in SVD with ones gives the optimal
projection back onto the orthonormal matrices, and the diagonal matrix ensures
that we do not get an improper rotation. There many derivations available in the
literature, but many of them drop the determinant constraint instead of handling the
improper rotation. See [4] (section 3) for one of my early favorites.
In the example code, I've made a random object (based on a random set of points
in the x-y plane), and perturbed it by a random 2D transform. The blue points are
the model points in the model coordinates, and the red points are the scene points.
The green dashed lines represent the (perfect) correspondences. On the right, I've
plotted both points again, but this time using the estimated pose to put them both
in the world frame. As expected, the algorithm perfectly recovers the ground truth
transformation.
Open in Colab
What is important to understand is that once the correspondences are given we have
an efficient and robust numerical solution to estimating the pose.
4.4 ITERATIVE CLOSEST POINT (ICP)
So how do we get the correspondences? In fact, if we were given the pose of the
object, then figuring out the correspondences is actually easy: the corresponding
point on the model is just the nearest neighbor / closest point to the scene point
when they are transformed into a common frame.
This observation suggests a natural iterative scheme, where we start with some
initial guess of the object pose and compute the correspondences via closest points,
then use those correspondences to update the estimated pose. This is one of the
famous and often used (despite its well-known limitations) algorithms for point cloud
registration: the iterative closest point (ICP) algorithm.
In words, we want to find the point in the model that is the closest in Euclidean
distance to the transformed scene points. This is the famous "nearest neighbor"
problem, and we have good numerical solutions (using optimized data structures)
for it. For instance, Open3D uses FLANN[5].
Although solving for the pose and the correspondences jointly is very difficult,
ICP leverages the idea that if we solve for them independently, then both parts
have good solutions. Iterative algorithms like this are a common approach taken
in optimization for e.g. bilinear optimization or expectation maximization. It is
important to understand that this is a local solution to a non-convex optimization
problem. So it is subject to getting stuck in local minima.
Here is ICP running on the random 2D objects. Blue are the model points, red are
the scene points, green dashed lines are the correspondences. I encourage you to
run the code yourself.
I've included one of the animation where it happens to converge to the true
optimal solution. But it gets stuck in a local minima more often than not! I
hope that stepping through will help your intuition. Remember that once the
correspondences are correct, the pose estimation will recover the exact solution.
So every one of those steps before convergence has at least a few bad
correspondences. Look closely!
Open in Colab
Intuition about these local minima has motivated a number of ICP variants, including
point-to-plane ICP, normal ICP, ICP that use color information, feature-based ICP, etc.
The example above is sufficient to demonstrate the problems that ICP can have
with local minima. But we have so far been dealing with unreasonably perfect point
clouds. Point cloud registration is an important topic in many fields, and not all of the
approaches you will find in the literature are well suited to dealing with the messy
point clouds we have to deal with in robotics.
Example 4.5 (ICP with messy point clouds)
I've added a number of parameters for you to play with in the notebook to add
outliers (scene points that do not actually correspond to any model point), to add
noise, and/or to restrict the points to a partial view. Please try them by themselves
and in combination.
Click here for the animation.
Open in Colab
Figure 4.7 - A sample run of ICP with outliers (modeled as additional points
drawn from a uniform distribution over the workspace).
Figure 4.8 - A sample run of ICP with Gaussian noise added to the positions of the
scene points.
Figure 4.9 - A sample run of ICP with scene points limited to a partial view. They
aren't all this bad, but when I saw that one, I couldn't resist using it as the
example!
This leaves us with a "chicken or the egg" problem -- we need a reasonable estimate
of the pose to detect outliers, but for very messy point clouds it may be hard to
get a reasonable estimate in the presence of those outliers. So where do we begin?
One common approach that was traditionally used with ICP is an algorithm called
RANdom SAmple Consensus (RANSAC) [6]. In RANSAC, we select a number of
random (typically quite small) subsets of "seed" points, and compute a pose estimate
for each. The subset with the largest number of "inliers" -- points that are below the
maximum correspondence distance -- is considered the winning set and can be used
to start the ICP algorithm.
There is another important and clever idea to know. Remember the important
observation that we can decouple the solution for rotation and translation by
exploiting the fact that the relative position between points depends on the rotation
but is invariant to translation. We can take that idea one step further and note
that the relative distance between points is actually invariant to both rotation and
translation. This has traditionally been used to separate the estimation of scale from
the estimation of translation and rotation (I've decided to ignore scaling throughout
this chapter, since it doesn't make immediate sense for our proposed problem). In
[7] they proposed that this can also be used to prune outliers even before we have an
initial pose estimate. If the distance between a pair of points in the scene is unlike
any distance between any pair of points in the model, then one of those two points
is an outlier. Finding the maximum set of inliers by this metric would correspond
to finding the maximum clique in a graph. Unfortunately, finding optimal cliques is
probably too expensive for large point clouds, so in [6] conservative approximations
are used instead.
4.5.2 Point cloud segmentation
Not all outliers are due to bad returns from the depth camera, very often they are
due to other objects in the scene. Even in the most basic case, our object of interest
will be resting on a table or in a bin. If we know the geometry of the table/bin
a priori, then we can subtract away points from that region. Alternatively, we can
crop the point cloud to a region of interest. This can be sufficient for very simple or
uncluttered scenes, and will be just enough for us to accomplish our goal for this
chapter.
More generally, if we have multiple objects in the scene, we may still want to find the
object of interest (mustard bottle?). If we had truly robust point cloud registration
algorithms, then one could imagine that the optimal registration would allow us
to correspond the model points with just a subset of the scene points; point cloud
registration could solve the object detection problem! Unfortunately, our algorithms
aren't strong enough.
As a result, we almost always use some other algorithm to "segment" the scene in
a number of possible objects, and run registration independently on each segment.
There are numerous geometric approaches to segmentation[9]. But these days we
tend to use neural networks for segmentation, so I won't go into those details here.
The problem of outliers or multiple objects in the scene challenges our current
approach to correspondences. So far, we've used the notation ci = j to represent the
correspondence of every scene point to a model point. But note the asymmetry in our
algorithm so far (scene ⇒ model). I chose this direction because of the motivation of
partial views from the mustard bottle example from the chapter opening. Once we
start thinking about tables and other objects in the scene, though, maybe we should
have done model ⇒ scene? You should verify for yourself that this would be a minor
change to the ICP algorithm. But what if we have multiple objects and partial views?
It would be simple enough to allow ci to take a special value for "no correspondence",
and indeed that helps. In that case, we would hope that, at the optimum, the scene
points corresponding to the model of interest would be mapped to their respective
model points, and the scene points from outliers and other objects get labelled as "no
correspondence". The model points that are from the occluded parts of the object
would simply not have any correspondences associated with them.
There is an alternative notation that we can use which is slightly more general.
Let's denote the correspondence matrix C ∈ {0, 1}Ns ×Nm , with Cij = 1 iff scene point i
corresponds with model point j. Using this notation, we can write our estimation
step (given known correspondences) as:
Ns Nm
min ∑ ∑ Cij ∥X O O pmj − psi ∥2 .
X O ∈SE(3)
i=1 j=1
This subsumes our previous approach: if ci = j then set Cij = 1 and the rest of the
row equal to zero. If the scene point i is an outlier, then set the entire row to zero.
The previous notation was a more compact representation for the true asymmetry
in the "closest point" computation that we did above, and made it clear that we only
needed to compute Ns distances. But this more general notation will take us to the
next round.
What if we relax our strict binary notion of correspondences, and think of them
instead as correspondence weights Cij ∈ [0, 1] instead of Cij ∈ {0, 1}? This is the
main idea behind the "coherent point drift" (CPD) algorithm [10]. If you read the
CPD literature you will be immersed in a probabilistic exposition using a Gaussian
Mixture Model (GMM) as the representation. The probabilistic interpretation is
useful, but don't let it confuse you; trust me that it's the same thing. To maximize
the likelihood on a Gaussian, step one is almost always to take the log because it is a
monotonic function, which gets you right back to our quadratic objective.
The probabilistic interpretation does give a natural mechanism for setting the
correspondence weights on each iteration of the algorithm, by thinking of them as
the probability of the scene points given the model points:
m 2
X O O p j −psi
1 −1
σ2
Cij = exp 2 , (4)
ai
which is the standard density function for a Gaussian, σ is the and ai is the proper
normalization constant to make the probabilities sum to one. It is also natural
to encode the probability of an outlier in this formulation (it simply modifies the
normalization constant). Click to see the normalization constant, but it's really an
implementation detail. The normalization works out to be
m 2
Nm X O O p j −psi
− 12 3 w Nm
ai = ∑ exp σ2
+(2πσ2 ) 2 , (5)
j=1
1 − w Ns
The CPD algorithm is now very similar to ICP, alternating between assigning the
correspondence weights and updating the pose estimate. The pose estimation step
is almost identical to the procedure above, with the "central" points now
O m 1 1
¯
p = ∑ Cij O pmj , ps¯ = ∑ Cij psi , NC = ∑ Cij ,
NC i,j NC i,j i,j
∥
W = ∑ Cij (psi − ps¯ )(O pmj − O pm¯ ) .
i,j
The rest of the updates, for extracting R using SVD and solving for p given R, are the
same. You won't see the sums explicitly in the code, because each of those steps has
a particularly nice matrix form if we collect the points into a matrix with one point
per column.
The word on the street is the CPD is considerably more robust than ICP with its hard
T
correspondences and quadratic objective; the Gaussian model mitigates the effect of
outliers by setting their correspondence weight to nearly zero. But it is also more
expensive to compute all of the pairwise distances for large point clouds. In [11],
we point out that a small reformulation (thinking of the scene points as inducing a
distribution over the model points, instead of vice versa) can get us back to summing
over the scene points only. It enjoys many of the robustness benefits of CPD, but also
the performance of algorithms like ICP.
All of the algorithms we've discussed so far have exploited the SVD solution to
the pose estimate given correspondences, and alternate between estimating the
correspondences and estimating the pose. There is another important class of
algorithms that attempt to solve for both simultaneously. This makes the
optimization problem nonconvex, which suggests they will still suffer from local
minima like we saw in the iterative algorithms. But many authors argue that the
solution times using nonlinear solvers can be on par with the iterative algorithms
(e.g. [12]).
Figure 4.11 - A variety of objective functions for robust registration. Click here for
the interactive version.
Another major advantage of using nonlinear optimization is that these solvers can
accommodate a wide variety of objective functions to achieve the same sort of
robustness to outliers that we saw from CPD. I've plotted a few of the popular
objective functions above. The primary characteristic of these functions is that they
taper, so that larger distances eventually do not result in higher cost. Importantly,
like in CPD, this means that we can consider all pairwise distances in our objective,
because if outliers add a constant value (e.g. 1) to the cost, then they have no
gradient with respect to the decision variables and no impact on the optimal
solution. We can therefore write, for our favorite loss function, l(x), an objective of
the form, e.g.
And what are the decision variables? We can also exploit the additional flexibility of
the solvers to use minimal parameterizations -- e.g. θ for rotations in 2D, or Euler
angles for rotations in 3D. The objective function is more complicated but we can
get rid of the constraints.
And in the case of the circle, every other model point contributes the same cost.
This is not a convex function, but every minima is a globally optimal solution (just
wrapped around by 2π.
In fact, even if we have a more complicated, non-circular, geometry, then this same
argument holds. Every point will incur and error as it rotates around the circle,
but may have a different radius. The error for all of the model points will decrease
as cos θerr goes to one. So every minima is a globally optimal solution. We haven't
actually introduced local minima (yet)!
It is the correspondences, and other constraints that we might add to the problem,
that really introduce local minima.
be of the form
∥
There is an exceptionally nice trick for speeding up the computation in these
nonlinear optimizations, but it does require us to go back to thinking about the
minimum distance from a scene point to the model, as opposed to the pairwise
distance between points. This, I think, it not a big sacrifice now that we have stopped
enumerating correspondences explicitly. Let's slightly modify our objective above to
In words, we can an apply arbitrary robust loss function, but we will only apply it to
the minimum distance between the scene point and the model.
The nested min functions look a little intimidating from a computational perspective.
But this form, coupled with the fact that in our application the model points are
fixed, actually enables a fantastic optimization. First, let's move our optimization into
the model frame by changing the inner term to
Now realize that for any 3D point x, we can precompute the minimum distance
from x to any point on our model, call it ϕ(x). The term above is just ϕ(O X W psi ).
This function of 3D space, sometimes called a distance field, and the closely related
signed distance function (SDF) and level sets are a common representation in
geometric modeling. And they are precisely what we need to quickly compute the
cost from many scene points.
Figure 4.12 - Contour plot of the distance function for our point-based
representation of the rectange (left), and the signed distance function for our
"mesh-based" 2D rectangle. The smoother distance contours in the mesh-based
representation can help alleviate some of the local minima problems we observed
above.
As the figure I've included above suggests, we can use the precomputed distance
functions to represent the minimum distance to a mesh model instead of just a
point cloud. And with some care, we can define alternative distance-like functions
in 3D space that have smooth subgradients which can work better with nonlinear
optimization. There is a rich literature on this topic; see [14] for a nice reference.
Is there any hope of exploiting the beautiful structure that we found in the pose
estimation with known correspondences with an algorithm that searches for
correspondences and pose simultaneously?
There are some algorithms that claim global optimality for the ICP objective,
typically using a branch and bound strategy [15, 16, 17], but we have been
disappointed with the performance of these on real point clouds. We proposed
another, using branch and bound to explicitly enumerate the correspondences with a
truncated least-squares objective, and a tight relaxation of the SE(3) constraints[18],
but this method is still limited to relatively small problems.
In recent work, TEASER [6] takes a different approach, using the observation
that truncated least squares can be optimized efficiently for the case of scale and
translation. For rotations, they solve a semi-definite relaxation of the truncated least-
squares objective. This relaxation is not guaranteed to be tight, but (like all of the
convex relaxations we describe in this chapter), it is easy to certify after the fact
whether the optimal solution satisfied the original constraints.
4.6 NON-PENETRATION AND "FREE-SPACE" CONSTRAINTS
We've explored two main lineages of algorithms for the pose estimation -- one based
on the beautiful SVD solutions and the other based on nonlinear optimization. As we
will see, non-penetration and free-space constraints are, in most cases, non-convex
constraints, so are a better match for the nonlinear optimization approach. But
there are some examples of convex non-penetration constraints (e.g., when points
must not penetrate a half-plane) and it is possible to include these in our convex
optimization approach. I'll just show it in a simple example.
R = [a −b].
b a
Recall that the rotation matrix constraints for this form reduce to a2 + b2 = 1. These
are non-convex constraints, but we can relax them to a2 + b2 ≤ 1. This is almost
exactly the problem we visualized in this earlier example, except that we will add
translations here. The relaxation is exactly changing the non-convex unit circle
constraint into the convex unit disk constraint. Based on that visualization, we
have reason to be optimistic that the unit disk relaxation could be tight.
Open in Colab
Please be careful. Now that we have a constraint that depends on p, our original
approach to solving for R independently is no longer valid (at least not without
modification). I've provided a simple implementation in colab.
Figure 4.14 - The convex approximation to the constrained pose estimation
problem with known correspondences. I have added constraints that both the x
and y positions of all of the points in the estimated pose must be greater than 0 in
the world frame (as illustrated by the green lines).
This is a slightly exaggerated case, where the scene points are really pulling the
box into the constraints. We can see that the relaxation is not tight in this situation;
the box is being shrunk slightly to explain the data.
Constrained optimizations like this can be made relatively efficient, and are
suitable for use in an ICP loop for modest-sized point clouds. But the major
limitation is that we can only take this approach for convex non-penetration
constraints (or convex approximations of the constraints), like the "half-plane"
constraints we used here. It is probably not very suitable for non-penetration
between two objects in the scene.
For the record, the fact that I was able to easily add this example here is actually
pretty special. I don't know of another toolbox that brings together the advanced
optimization algorithms with the geometry / physics / dynamical systems in the
way that Drake does.
Figure 4.15 - The rays cast from the camera to the depth returns define a "free
space obstacle" that other objects should not penetrate.
Ultimately, as much as I prefer the convex formulations with their potential for
deeper understanding (by us) and for stronger guarantees (for the algorithms), the
ability to add non-penetration constraints and free-space constraints is simply too
valuable to ignore. I hope that, if you come back to these notes in a year or two, I will
be able to report that we have strong results for these more complex formulations.
But for now, in most applications, I will steer you towards the nonlinear optimization
approaches and taking advantage of these constraints.
Although I very much appreciate the ability to think rigorously about these
geometric approaches, we need to think critically about whether we are solving for
the right objectives.
Even if we stay in the realm of pure geometry, it is not clear that a least-squares
objective with equal weights on the points is correct. Imagine a tall skinny book
laying flat on the table -- we might only get a very small number of returns from the
edges of the book, but those returns contain proportionally much more information
than the slew of returns we might get from the book cover. It is no problem to include
relative weights in the estimation objectives we have formulated here, but we don't
yet have very successful geometry-based algorithms for deciding what those weights
should be in any general way. (There has been a lot of research in this direction, but
it's a hard problem.)
Please realize, though, that as beautiful as geometry is, we are so far all but ignoring
the most important information that we have: the color values! While it is possible
to put color and other features into an ICP-style pipeline, it is very hard to write a
simple distance metric in color space (the raw color values for a single object might
be very different in different lighting conditions, and the color values of different
objects can look very similar). Advances in computer vision, especially those based
on deep learning over the last few years, have brought new life to this problem.
When I asked my students recently "If you had to give up one of the channels, either
depth or color, which would you give up?" the answer was a resounding "I'd give up
depth; don't take away my color!" That's a big change from just a few years ago.
But deep learning and geometry can (should?) work nicely together. For example,
identifying correspondences between point clouds has been a major theme in this
chapter -- and we haven't completely solved it. This is one of the many places where
the color values and deep learning have a lot to offer [20]. We'll turn our attention to
them soon!
4.9 EXERCISES
rotation. If you only keep the best one (lowest estimation error), then how much
does it improve the performance over a single ICP run?
The model points and the scene points are given as:
The true correspondence is given by their numbering, but note that we don't
know the true correspondence - ICP simply determines it based on nearest
neighbors. Given our vanilla-ICP cost (sum of pairwise distances squared), we
can parametrize the 2D pose X O with px , py , a, b and write down the resulting
optimization as:
2
min ∑ (px ) + (a −b)pmi − psi
px ,py ,a,b py b a
i
2 2
s.t. a + b = 1.
a. When the initial guess for the pose results in the correct
correspondences based on nearest-neighbors, show that the ICP cost is
minimum when px , py , b = 0 and a = 1. Describe the set of initial poses
that results in convergence to the true solution.
b. When the initial guess results in a flipped correspondence, show that
ICP cost is minimum when px , py , b = 0 and a = −1. Describe the set of
initial poses that results in convergence to this incorrect solution.
c. Remember that correspondence need not be one-to-one. In fact they are
often not when computed based on nearest-neighbors. By constructing
the data matrix W , show that when both scene points correspond to one
model point, ICP gets stuck and does not achieve zero-cost. (HINT: You
may assume that doing SVD on a zero matrix leads to identity matrices
for U and V ).
∥
For this exercise, you will remove the environmental background from the
Stanford bunny pointcloud using the RANSAC algorithm. You will work
exclusively in this notebook. You will be asked to complete the following steps:
[1] Chris Sweeney, Greg Izatt, and Russ Tedrake. A supervised approach to
predicting noise in depth images. In Proceedings of the IEEE International
Conference on Robotics and Automation. IEEE, May 2019. [ www: ]
[2] Berthold K.P. Horn. Closed-form solution of absolute orientation using unit
quaternions. Journal of the Optical Society of America A, 4(4):629--642, April
1987.
[3] Andriy Myronenko and Xubo Song. On the closed-form solution of the rotation
matrix arising in computer vision problems. arXiv preprint arXiv:0904.1613,
2009.
[4] Robert M Haralick, Hyonam Joo, Chung-Nan Lee, Xinhua Zhuang, Vinay G
Vaidya, and Man Bae Kim. Pose estimation from corresponding point data. IEEE
Transactions on Systems, Man, and Cybernetics, 19(6):1426--1446, 1989.
[5] Marius Muja and David G Lowe. Flann, fast library for approximate nearest
neighbors. In International Conference on Computer Vision Theory and
Applications (VISAPP’09), volume 3. INSTICC Press, 2009.
[6] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for
model fitting with applications to image analysis andd automated cartography.
In Communications of the Association for Computing Machinery (ACM),
volume 24, pages 381--395, 1981.
[7] Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point
cloud registration. arXiv preprint arXiv:2001.07715, 2020.
[8] Pat Marion, Peter R. Florence, Lucas Manuelli, and Russ Tedrake. A pipeline
for generating ground truth labels for real RGBD data of cluttered scenes.
In International Conference on Robotics and Automation (ICRA), Brisbane,
Australia, May 2018. [ www: ]
[9] Anh Nguyen and Bac Le. 3d point cloud segmentation: A survey. In 2013
6th IEEE conference on robotics, automation and mechatronics (RAM), pages
225--230. IEEE, 2013.
[10] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift.
IEEE transactions on pattern analysis and machine intelligence,
32(12):2262--2275, 2010.
[11] Wei Gao and Russ Tedrake. Filterreg: Robust and efficient probabilistic point-
set registration using gaussian filter and twist parameterization. In
Proceedings of theConference on Computer Vision and Pattern Recognition
(CVPR), June 2019. [ www: ]
[12] Andrew W Fitzgibbon. Robust registration of 2d and 3d point sets. Image and
vision computing, 21(13-14):1145--1153, 2003.
[13] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven
Lovegrove. DeepSDF: Learning continuous signed distance functions for shape
representation. In Proceedings of the IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), June (To Appear) 2019.
[ www: ]
[14] Stanley Osher and Ronald Fedkiw. Level Set Methods and Dynamic Implicit
Surfaces. Applied Mathematical Sciences. Springer, 2003.
[15] Natasha Gelfand, Niloy J Mitra, Leonidas J Guibas, and Helmut Pottmann.
Robust global registration. In Symposium on geometry processing, volume 2,
page 5. Vienna, Austria, 2005.
[16] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast global
pointcloud registration via smart indexing. In Computer Graphics Forum,
volume 33, pages 205--215. Wiley Online Library, 2014.
[17] Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go-ICP: A globally
optimal solution to 3d icp point-set registration. IEEE transactions on pattern
analysis and machine intelligence, 38(11):2241--2254, 2015.
[18] Gregory Izatt, Hongkai Dai, and Russ Tedrake. Globally optimal object pose
estimation in point clouds with mixed-integer programming. International
Symposium on Robotics Research, Dec 2017. [ www: ]
[19] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Dart: dense articulated
real-time tracking with consumer depth cameras. Autonomous Robots,
39(3):239--258, 2015.
[20] Peter R. Florence*, Lucas Manuelli*, and Russ Tedrake. Dense object nets:
Learning dense visual object descriptors by and for robotic manipulation. In
Conference on Robot Learning (CoRL), October 2018. [ www: ]
CHAPTER 5 Open in Colab
Bin Picking
Our initial study of geometric perception gave us some powerful tools, but also
revealed some major limitations. In the next chapter, we will begin applying
techniques from deep learning to perception. Spoiler alert: those methods are going
to be insatiable in their hunger for data. So before we get there, I'd like to take a
brief excursion into a nice subproblem that might help us feed that need.
In this chapter we'll consider the simplest version of the bin picking problem: the
robot has a bin full of random objects and simply needs to move those objects from
one bin to the other. We'll be agnostic about what those objects are and about where
they end up in the other bin, but we would like our solution to achieve a reasonable
level of performance for a very wide variety of objects. This turns out to be a pretty
convenient way to create a training ground for robot learning -- we can set the robot
up to move objects back and forth between bins all day and night, and intermittently
add and remove objects from the bin to improve diversity. Of course, it is even easier
in simulation!
If our goal is to test a diversity of bin picking situations, then the first task is to figure
out how to generate diverse simulations. How should we populate the bin full of
objects? So far we've set up each simulation by carefully setting the initial positions
(in the Context) for each of the objects, but that approach won't scale.
The first change that you'll see in this chapter is that I will start using Drake's
"Model Directives" for parsing more complex environments from yaml files. That
will help us set up the bins, cameras, and the robot which can be welded in fixed
positions. But generating the distributions over objects and object initial conditions
is more bespoke, and we need an algorithm for that.
In the real world, we would probably just dump the random objects into the bin.
That's a decent strategy for simulation, too. We can roughly expect our simulation
to faithfully implement multibody physics as long as our initial conditions (and time
step) are reasonable; the physics isn't well defined if we initialize the Context with
multiple objects occupying the same physical space. The simplest and most common
way to avoid this is to generate a random number of objects in random poses, with
their vertical positions staggered so that they trivially start out of penetration.
If you look for them, you can find animations of large numbers of falling objects in
the demo reels for most advanced multibody simulators. But we're doing it for a
purpose!
I've set the initial positions for each object in the Context to be uniformly
distributed over the horizontal position, uniformly rotated, and staggered every
0.1m in their initial vertical position. We only have to simulate for a little more than
a second for the bricks to come to rest and give us our intended "initial conditions".
Open in Colab
It's not really any different to do this with any random objects -- here is what it looks
like when I run the same code, but replacing the brick with a random draw from a
few objects from the YCB dataset. It somehow amuses me that we can see the central
limit theorem hard at work, even when our objects are slightly ridiculous.
Open in Colab
I had to decide how to visualize the results of this one for you. The mesh and
texture map files for the YCB objects are very large, so downloading many of
them to your browser from Colab felt a bit too painful. If you've decided to run
the notebooks on your local machine, then go ahead and run drake_visualizer
before running this test to see the live simulation (drake visualizer will load the
mesh files directly from your disk, so avoid the download). For Colab, I've decided
to add a camera to the scene and take a picture after the simulating for a few
seconds. After all, that's perhaps the data that we're actually looking for.
Please appreciate that bins are a particularly simple case for generating random
scenarios. If you wanted to generate random kitchen environments, for example,
then you won't be as happy with a solution that drops refrigerators, toasters, and
stools from uniformly random i.i.d. poses. In those cases, authoring reasonable
distributions gets much more interesting; we will revisit the topic of generative
scene models later in the notes.
Even in the case of bins, we should try to think critically about whether dropping
objects from a height is really the best solution. Given our discussion in the last
chapter about writing optimizations with non-penetration constraints, I hope you are
already asking yourself: why not use those constraints again here? Let's explore that
idea a bit further.
I won't dive into a full discussion of multibody dynamics nor multibody simulation,
though I do have more notes available here. What is important to understand here
is that the equations of motion of our multibody system are described by differential
equations of the form:
The left side of the equation is just a generalization of "mass times acceleration",
with the mass matrix, M , and the Coriolis terms C . The right hand side is the sum
of the (generalized) forces, with τg (q) capturing the terms due to gravity, and F ci is
the spatial force due to the ith contact. Ji (q) is the ith "contact Jacobian" -- it is the
Jacobian that maps from the generalized velocities to the spatial velocity of the ith
contact frame.
But we still need to understand where the contact forces come from.
Open in Colab
When I play with the GUI above, I feel almost overwhelmed. First, overwhelmed by
the sheer number of cases that we have to get right in the code; it's unfortunately
extremely common for open-source tools to have bugs in here. But second,
overwhelmed by the sense that we are asking the wrong question. Asking to
summarize the forces between two bodies in deep penetration with a single
Cartesian force applied at a point is fraught with peril. As you move the objects,
you will find many discontinuities; this is a common reason why you sometimes
see rigid body simulations "explode". It might seem natural to try to use multiple
contact points instead of a single contact point to summarize the forces, and some
simulators do, but it is very hard to write an algorithm that only depends on the
current configurations of the geometry which applies forces consistently from one
time step to the next with these approaches.
I currently feel that the only fundamental way to get around these numerical
issues is to start reasoning about the entire volume of penetration. The Drake
developers have recently proposed a version of this that we believe we can make
computationally tractable enough to be viable for real-time simulation [1], and are
working hard to bring this into full service in Drake. You will find many references
to "hydroelastic" contact throughout the code. We hope to turn it on by default soon.
In the mean time, we can make simulations robust by carefully curating the contact
geometry...
Figure 5.4 - The (exaggerated) contact geometry used for robust simulation of
boxes. We add contact "points" (epsilon radius spheres) to each corner, and have a
slightly inset box for the remaining contacts. Here is the interactive version.
Now that we have that ability to populate our bins full of random clutter, we need
to determine where we will grasp. The pipeline that one might expect, and indeed
the pipeline that robotics was somewhat stuck on for decades, is that the first step
should be to write some form of perception system to look into the bin and segment
the point cloud into individual objects. These days that's not impossible, but it's still
hard!
The newer approach that has proven incredibly useful for bin picking is to just look
for graspable areas directly on the (unsegmented) point cloud. Some of the most
visible proponents of this approach are [2, 3, 4]. If you look at those references,
you'll see that they all use learning to somehow decide where in the point cloud to
grasp. And I do think learning has a lot to offer in terms of choosing good grasps
-- it's a nice and natural learning formulation and there is significant information
besides just what appears in the immediate sensor returns that can contribute to
deciding where to grasp. But often you will see that underlying those learning
approaches is an approach to selecting grasps based on geometry alone, if only to
produce the original training data. I also think that the community doesn't regularly
acknowledge just how well the purely geometric approaches still work. I'd like to
discuss those first.
To get a good view into the bin, we're going to set up multiple RGB-D cameras. I've
used three per bin in all of my examples here. And those cameras don't only see the
objects in the bin; they also see the bins, the other cameras, and anything else in the
viewable workspace. So we have a little work to do to merge the point clouds from
multiple cameras into a single point cloud that only includes the area of interest.
First, we can crop the point cloud to discard any points that are from outside the
area of interest (which we'll define as an axis-aligned bounding box immediately
above the known location of the bin).
As we will discuss in some detail below, many of our grasp selection strategies will
benefit from estimating the "normals" of the point cloud (a unit vector that estimates
the normal direction relative to the surface of the underlying geometry). It is actually
better to estimate the normals on the individual point clouds, making use of the
camera location that generated those points, than to estimate the normal after the
point cloud has been merged.
For sensors mounted on the real world, merging point clouds requires high-quality
camera calibration and must deal with the messy depth returns. All of the tools
from the last chapter are relevant, as the tasks of merging the point clouds is
another instance of the point-cloud-registration problem. For the perfect depth
measurements we can get out of simulation, given known camera locations, we can
skip this step and simply concatenate the list of points in the point clouds together.
Finally, the resulting raw point clouds likely include many more points then we
actually need for our grasp planning. One of the standard approaches for down-
sampling the point clouds is using a voxel grid -- regularly sized cubes tiled in 3D.
We then summarize the original point cloud with exactly one point per voxel (see, for
instance Open3D's note on voxelization). Since point clouds typically only occupy a
small percentage of the voxels in 3D space, we use sparse data structures to store
the voxel grid. In noisy point clouds, this voxelization step is also a useful form of
filtering.
Example 5.4 (Mustard bottle point clouds)
Open in Colab
I've produced a scene with three cameras looking at our favorite YCB mustard
bottle. I've taken the individual point clouds (already converted to the world frame
by the DepthImageToPointCloud system), converted them into Open3D's point
cloud format, cropped the point clouds (to get rid of the geometry from the other
cameras), estimated their normals, merged the point clouds, then down-sampled
the point clouds. The order is important!
I've pushed all of the point clouds to meshcat, but with many of them set to not be
visible by default. Use the drop-down menu to turn them on and off, and make sure
you understand basically what is happening on each of the steps. For this one, I
can also give you the meshcat output directly, if you don't want to use Colab.
It's worth going through the basic estimation of the local geometry in a point cloud.
You'll see it has a lot in common with the point cloud registration we studied in the
last chapter. Let's think about the problem of fitting a plane, in a least-squares sense,
to a set of points[5]. We can describe a plane in 3D with a position p and a unit
length normal vector, n. The distance between a point pi and a plane is simply given
by the magnitude of the inner product, (pi − p)T n . So our least-squares optimization
becomes
N
2
min ∑ (pi − p)T n , subject to |n| = 1.
p,n
i=1
Taking the gradient of the Lagrangian with respect to p and setting it equal to zero
gives us that
1 N i
p∗ = ∑p .
N i=1
Inserting this back into the objective, we can write the problem as
min nT W n, subject to |n| = 1, where W = [∑(pi − p∗ )(pi − p∗ )T ].
n
i
Geometrically, this objective is a quadratic bowl centered at the origin, with a unit
circle constraint. So the optimal solution is given by the (unit-length) eigenvector
corresponding to the smallest eigenvalue of the data matrix, W . And for any optimal
n, the "flipped" normal −n is also optimal. We can pick arbitrarily for now, and then
flip the normals in a post-processing step (to make sure that the normals all point
towards the camera origin).
Open in Colab
I've coded up the basic least-squares surface estimation algorithm, with the query
point in green, the nearest neighbors in blue, and the local least squares
estimation drawn with our RGB⇔XYZ frame graphic. You should definitely slide it
around and see if you can understand how the axes line up with the normal and
local curvature.
You might wonder where you can read more about algorithms of this type. I don't
have a great reference for you. But Radu Rusu was the main author of the point
cloud library[6], and his thesis has a lot of nice summaries of the point cloud
algorithms of 2010[7].
Now that we have processed our point cloud, we have everything we need to start
planning grasps. I'm going to break that discussion down into two steps. In this
section we'll come up with a cost function that scores grasp candidates. In the
following section, we'll discuss some very simple ideas for trying to find grasps
candidates that have a low cost.
There is a massive literature in robotics on "grasp metrics" and "grasp analysis". The
Handbook of Robotics[8] has an entire chapter on grasping devoted to it (which I
recommend)[9]. For a course on manipulation (especially from someone who cares
a lot about dynamics) to spend so little time on this topic is perhaps surprising,
but a great deal of that literature is focused on analyzing point contacts between a
dexterous hand and an object with known geometry. That's simply not the situation
we find ourselves in for a lot of manipulation work today. We have ourselves a simple
two fingered grippers with relatively a lot of grasping force (> 80 N) and only local
point clouds to represent our geometry.
But there is one key idea that I do want to extract from that literature: the value
of antipodal grasps. Roughly speaking, once we pick up an object -- or whatever
happens to be between our fingers when we squeeze -- then we will expect the
contact forces between our fingers to have to resist at least the gravitational wrench
(the spatial force due to gravity) of the object. The closing force provided by our
gripper is in the gripper's x-axis, but if we want to be able to pick up the object
without it slipping from our hands, then we need forces inside the friction cones of
our contacts to be able to resist the gravitational wrench. Since we don't know what
that wrench will be (and are somewhat constrained by the geometry of our fingers
from tilting our contact normals into the vertical direction), a reasonable strategy
is to look for places on the point cloud where the normals will align with x-axis of
the gripper. That means finding a pair of points on the point cloud with normals
that are collinear with the line connecting the points and are pointing in opposite
directions, these are antipodal points. In a real point cloud, we are unlikely to find
perfect antipodal pairs, but finding areas with normals pointing in nearly opposite
directions is a good strategy for grasping!
where niGx is the x component of the ith point in the cropped point cloud expressed
in the gripper frame.
There are other considerations for what might make a good grasp, too. For our
kinematically limited robot reaching into a bin, we might favor grasps that put the
hand in favorable orientation for the arm. In the grasp metric I've implemented
in the code, I added a cost for the hand deviating from vertical. I can reward the
dot product of the vector world −z vector, [0, 0, −1] with the y-axis in gripper frame
rotated into world frame with :
⎡0⎤
G G
cost += −α[0 0 −1]R 1 = αR3,2 ,
⎣ ⎦
0
Finally, we need to consider collisions between the candidate grasp and both the
bins and with the point cloud. I simply return infinite cost when the gripper is
in collision. I've implemented all of those terms in the notebook, and given you a
sliders to move the hand around and see how the cost changes.
Open in Colab
We've defined a cost function that, given a point cloud from the scene and a model
of the environment (e.g. the location of the bins), can score a candidate grasp pose,
X G . So now we would like to solve the optimization problem: find X G that minimizes
the cost subject to the collision constraints.
One heuristic, used for instance in [1], is to use the local curvature of the point
cloud to propose grasp candidates that have the point cloud normals pointing into
the palm, and orients the hand so that the fingers are aligned with the direction
of maximum curvature. One can move the hand in the direction of the normal until
the fingers are out of collision, and even sample nearby points. We have written an
exercise for you to explore this heuristic. But for our YCB objects, I'm not sure it's
the best heuristic; we have a lot of boxes, and boxes don't have a lot of information
to give in their local curvature.
Another heuristic is to find antipodal point pairs in the point cloud, and then
sample grasp candidates that would align the fingers with those antipodal pairs.
Many 3D geometry libraries support "ray casting" operations at least for a voxel
representation of a point cloud; so a reasonable approach to finding antipodal pairs
is to simply choose a point at random in the point cloud, then ray cast into the point
cloud in the opposite direction of the normal. If the normal of the voxel found by ray
casting is sufficiently antipodal, and if the distance to that voxel is smaller than the
gripper width, then we've found a reasonable antipodal point pair.
I do implement one more heuristic -- once I've found the points in the point cloud
that are between the finger tips, then I move the hand out along the gripper x
-axis so that those points are centered in the gripper frame. This helps prevent us
knocking over objects as we close our hands to grasp them.
But that's it! It's a very simple strategy. I sample a handful of candidate grasps and
just draw the top few with the lowest cost. If you run it a bunch of times, I think
you will find it's actually quite reasonable. Every time it runs, it is simulating the
objects falling from the sky; the actual grasp evaluation is actually quite fast.
Open in Colab
If you play around with the grasp scoring I've implemented above a little bit, you will
find deficiencies. Some of them are addressed easily (albeit heuristically) by adding
a few more terms to the cost. For instance, I didn't check collisions of the pre-grasp
configuration, but this could be added easily.
There are other cases where grasping alone is not sufficient as a strategy. Imagine
that you place an object right in one of the corners of the bin. It might not be possible
to get the hand around both sides of the object without being in collision with either
the object or the side. The strategy above will never choose to try to grab the very
corner of a box (because it always tried to align the sample point normal with the
gripper x), and it's not clear that it should. This is probably especially true for our
relatively large gripper. In the setup we used at TRI, we implemented an additional
simple "pushing" heuristic that would be used if there were point clouds in the sink,
but no viable grasp candidate could be found. Instead of grasping, we would drive
the hand down and nudge that part of the point cloud towards the middle of the bin.
This can actually help a lot!
There are other deficiencies to our simple approach that would be very hard to
address with a purely geometric approach. Most of them come down to the fact that
our system so far has no notion of "objects". For instance, it's not uncommon to see
this strategy result in "double picks" if two objects are close together in the bin. For
heavy objects, it might be important to pick up the object close to the center of mass,
to improve our chances of resisting the gravitational wrench while staying in our
friction cones. But our strategy here might pick up a heavy hammer by squeezing
just the very far end of the handle.
Interestingly, I don't think the problem is necessarily that the point cloud
information is insufficient, even without the color information. I could show you
similar point clouds and you wouldn't make the same mistake. These are the types
of examples where learning a grasp metric could potentially help. We don't need to
achieve artificial general intelligence to solve this one; just experience knowing that
when we tried to grasp in someplace before we failed would be enough to improve
our grasp heuristics significantly.
5.5 EXERCISES
REFERENCES
[1] Ryan Elandt, Evan Drumwright, Michael Sherman, and Andy Ruina. A pressure
field model for fast, robust approximation of net contact force and moment
between nominally rigid objects. pages 8238--8245, 11 2019.
[2] Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose
detection in point clouds. The International Journal of Robotics Research,
36(13-14):1455--1473, 2017.
[3] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu
Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan
robust grasps with synthetic point clouds and analytic grasp metrics. arXiv
preprint arXiv:1703.09312, 2017.
[4] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R Hogan, Maria
Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, et al. Robotic pick-
and-place of novel objects in clutter with multi-affordance grasping and cross-
domain image matching. In 2018 IEEE international conference on robotics and
automation (ICRA), pages 1--8. IEEE, 2018.
[5] Craig M Shakarji. Least-squares fitting algorithms of the nist algorithm testing
system. Journal of research of the National Institute of Standards and
Technology, 103(6):633, 1998.
[6] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA), Shanghai, China, May 9-13 2011.
[7] Radu Bogdan Rusu. Semantic 3D Object Maps for Everyday Manipulation in
Human Living Environments. PhD thesis, Institut für Informatik der Technischen
Universität München, 2010.
[8] Bruno Siciliano and Oussama Khatib. Springer handbook of robotics. Springer,
2016.
[9] Domenico Prattichizzo and Jeffrey C Trinkle. Grasping. In Springer Handbook of
Robotics, pages 671--700. Springer Berlin Heidelberg, 2008.
CHAPTER 6
Object detection and
segmentation
Our study of geometric perception gave us good tools for estimating the pose of
a known object. These algorithms can produce highly accurate estimates, but are
still subject to local minima. When the scenes get more cluttered/complicated, or if
we are dealing with many different object types, they really don't offer an adequate
solution by themselves.
Deep learning has given us data-driven solutions that complement our geometric
approaches beautifully. Finding correlations in massive datasets has proven to be
a fantastic way to provide practical solutions to these more "global" problems like
detecting whether the mustard bottle is even in the scene, segmenting out the
portion of the image / point cloud that is relevant to the object, and even in providing
a rough estimate of the pose that could be refined with a geometric method.
There are many sources of information about deep learning on the internet, and I
have no ambition of replicating nor replacing them here. But this chapter does being
our exploration of deep perception in manipulation, and I feel that I need to give just
a little context.
... annotations fall into one of two categories: (1) image-level annotation of
a binary label for the presence or absence of an object class in the image,
e.g., "there are cars in this image" but "there are no tigers," and (2) object-
level annotation of a tight bounding box and class label around an object
instance in the image, e.g., "there is a screwdriver centered at position
(20,25) with width of 50 pixels and height of 30 pixels".
Figure 6.1 - A sample annotated image from the COCO dataset, illustrating the
difference between image-level annotations, object-level annotations, and
segmentations at the class/semantic- or instance- level..
Instance segmentation turns out to be an very good match for the perception needs
we have in manipulation. In the last chapter we found ourselves with a bin full
of YCB objects. If we want to pick out only the mustard bottles, and pick them
out one at a time, then we can use a deep network to perform an initial instance-
level segmentation of the scene, and then use our grasping strategies on only the
segmented point cloud. Or if we do need to estimate the pose of an object (e.g.
in order to place it in a desired pose), then segmenting the point cloud can also
dramatically improve the chances of success with our geometric pose estimation
algorithms.
The ImageNet and COCO datasets contain labels for a variety of interesting classes,
including cows, elephants, bears, zebras and giraffes. They have a few classes that
are more relevant to manipulation (e.g., plates, forks, knives, and spoons), but they
don't have a mustard bottle nor a can of potted meat like we have in the YCB dataset.
So what are we do to? Must we produce the same image annotation tools and pay
for people to label thousands of images for us?
One of the most amazing and magical properties of the deep architectures that
have been working so well for instance-level segmentation is that they appear to
have some modularity in them. An image that was pre-trained on a large dataset
like ImageNet or COCO can be fine-tuned with a relatively much smaller amount of
labeled data to a new set of classes that are relevant to our particular application.
In fact, the architectures are often referred to as having a "backbone" and a "head"
-- in order to train a new set of classes, it is often possible to just pop off the
existing head and replace it with a new head for the new labels. A relatively small
amount of training with a relatively small dataset can still achieve surprisingly
robust performance. Moreover, it seems that training initially on the diverse dataset
(ImageNet or COCO) is actually important to learn the robust perceptual
representations that work for a broad class of perception tasks. Incredible!
This is great news! But we still need some amount of labeled data for our objects
of interest. The last few years have seen a number of start-ups based purely on the
business model of helping you get your dataset labeled. But thankfully, this isn't our
only option.
Just as projects like LabelMe helped to streamline the process of providing pixel-
wise annotations for images downloaded from the web, there are a number of tools
that have been developed to streamline the annotation process for robotics. Our
group introduced a tool called LabelFusion, which combines geometric perception
of point clouds with a simple user interface to very rapidly label a large number of
images[4].
Figure 6.2 - A multi-object scene from LabelFusion [3]. (Mouse over for animation)
Tools like LabelFusion can be use to label large numbers of images very quickly
(three clicks from a user produces ground truth labels in many images).
All of this real world data is incredibly valuable. But we have another super powerful
tool at our disposal: simulation! Computer vision researchers have traditionally
been very skeptical of training perception systems on synthetic images, but as
game-engine quality physics-based rendering has become a commodity technology,
roboticists have been using it aggressively to supplement or even replace their real-
world datasets. The annual robotics conferences now feature regular workshops
and/or debates on the topic of "sim2real". It is still not clear whether we can
generate accurate enough art assets (with tune material properties) and
environment maps / lighting conditions to represent even specific scenes to the
fidelity we need. An even bigger question is whether we can generate a diverse
enough set of data with distributions representative of the real world to train robust
feature detectors in the way that we've managed to train with ImageNet. But for
many serious robotics groups, synthetic data generation pipelines have significantly
augmented or even replaced real-world labeled data.
For the purposes of this chapter, I aim to train an instance-level segmentation system
that will work well on our simulated images. For this use case, there is (almost) no
debate! Leveraging the pretrained backbone from COCO, I will use only synthetic
data for fine tuning.
You may have noticed it already, but the RgbdSensor that we've been using in Drake
actually has a "label image" output port that we haven't used yet.
→ color_image
→ depth_image_32f
→ label_image
→ X_WB
This output port exists precisely to support the perception training use case we have
here. It outputs an image that is identical to the RGB image, except that every pixel
is "colored" with a unique instance-level identifier.
Figure 6.3 - Pixelwise instance segmentation labels provided by the "label image"
output port from RgbdSensor. I've remapped the colors to be more visually distinct.
I've verified that this code can run on Colab, but to make a dataset of 10k
images using this un-optmized process takes about an hour on my big development
desktop. And curating the files is just easier if you run it locally. So I've provided
this one as a python script instead.
python3 manipulation/clutter_maskrcnn_data.py
You can also feel free to skip this step! I've uploaded the 10k images that I
generated here. We'll download that directly in our training notebook.
6.2 OBJECT DETECTION AND SEGMENTATION
There is a lot to know about modern object detection and segmentation pipelines. I'll
stick to the very basics.
For image recognition (see Figure 1), one can imagine training a standard
convolutional network that takes the entire image as an input, and outputs a
probability of the image containing a sheep, a dog, etc. In fact, these architectures
can even work well for semantic segmentation, where the input is an image and the
output is another image; a famous architecture for this is the Fully Convolutional
Network (FCN) [5]. But for object detection and instance segmentation, even the
number of outputs of the network can change. How do we train a network to output
a variable number of detections?
The mainstream approach to this is to first break the input image up into many
(let's say on the order of 1000) overlapping regions that might represent interesting
sub-images. Then we can run our favorite image recognition and/or segmentation
network on each subimage individually, and output a detection for each region that
that is scored as having a high probability. In order to output a tight bounding box,
the detection networks are also trained to output a "bounding box refinement" that
selects a subset of the final region for the bounding box. Originally, these region
proposals were done with more traditional image preprocessing algorithms, as in
R-CNN (Regions with CNN Features)[6]. But the "Fast" and "Faster" versions of R-
CNN replaced even these preprocessing with learned "region proposal networks"[7,
8].
For instance segmentation, we will use the very popular Mask R-CNN network which
puts all of these ideas, using region proposal networks and a fully convolutional
networks for the object detection and for the masks [9]. In Mask R-CNN, the
masks are evaluated in parallel from the object detections, and only the masks
corresponding to the most likely detections are actually returned. At the time of this
writing, the latest and most performant implementation of Mask R-CNN is available
in the Detectron2 project from Facebook AI Research. But that version is not quite
as user-friendly and clean as the original version that was released in the PyTorch
torchvision package; we'll stick to the torchvision version for our experiments
here.
Open in Colab
(Training Notebook)
Training a network this big (it will take about 150MB on disk) is not fast. I strongly
recommend hitting play on the cell immediately after the training cell while you
are watching it train so that the weights are saved and downloaded even if your
Colab session closes. But when you're done, you should have a shiny new network
for instance segmentation of the YCB objects in the bin!
I've provided a second notebook that you can use to load and evaluate the trained
model. If you don't want to wait for your own to train, you can examine the one
that I've trained!
Open in Colab
(Inference Notebook)
Figure 6.4 - Outputs from the Mask R-CNN inference. (Left) Object detections.
(Right) One of the instance masks.
We can use our Mask R-CNN inference in a manipulation to do selective picking from
the bin...
6.4 EXERCISES
REFERENCES
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
Imagenet large scale visual recognition challenge. International journal of
computer vision, 115(3):211--252, 2015.
[2] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects
in context. In European conference on computer vision, pages 740--755.
Springer, 2014.
[3] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman.
Labelme: a database and web-based tool for image annotation. International
journal of computer vision, 77(1-3):157--173, 2008.
[4] Pat Marion, Peter R. Florence, Lucas Manuelli, and Russ Tedrake. A pipeline
for generating ground truth labels for real RGBD data of cluttered scenes.
In International Conference on Robotics and Automation (ICRA), Brisbane,
Australia, May 2018. [ www: ]
[5] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3431--3440, 2015.
[6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 580--587, 2014.
[7] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference
on computer vision, pages 1440--1448, 2015.
[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in neural
information processing systems, pages 91--99, 2015.
[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN.
In Proceedings of the IEEE international conference on computer vision, pages
2961--2969, 2017.
CHAPTER 7 Open in Colab
Force Control
Over the last few chapters, we've developed a great initial toolkit for perceiving
objects (or piles of objects) in a scene, planning grasps, and moving the arm to
grasp. But there is a lot more to manipulation than grasping!
In fact, the signs have already been there if you were paying close attention. I'm sure
that you noticed image number 9954 in my clutter segmentation dataset. Good old
09954.png. Just in case you happened to miss it, here it is again.
I don't know how many of the other 9999 random scenes that we generated are
going to have this problem, but 9954 makes it nice and clear. How in the world are
we going to pick up that "Cheez-it" cracker box with our two-fingered gripper? (I
know there are a few of you out there thinking about how easy that scene is for your
suction gripper, but suction alone isn't going to solve all of our problems either.)
Figure 7.2 - Just for fun, I asked my daughter to try a similar task with the "two-
fingered gripper" I scrounged from my kitchen. How are we to program something
like that!
As always in our battle against complexity, I want to find a setup that is as simple
as possible (but no simpler)! Here is my proposal for the box-flipping example.
First, I will restrict all motion to a 2D plane; this allows me to chop off two sides
of the bin (for you to easily see inside), but also drops the number of degrees of
freedom we have to consider. In particular, we can avoid the quaternion floating
base coordinates, by adding a PlanarJoint to the box. Instead of using a complete
gripper, let's start even simpler and just use a "point finger". I've visualized it as a
small sphere, and modeled two control inputs as providing force directly on the x
and z coordinates of the point mass.
Figure 7.3 - The simple model: a point finger, a cracker box, and the bin all in 2D.
The green arrows are the contact forces.
Even in this simple model, and throughout the discussions in this chapter, we will
have two dynamic models that we care about. The first is the full dynamic model
used for simulation: it contains the finger, the box, and the bin, and has a total of
5 degrees of freedom (x, z, θ for the box and x, z for the finger). The second is the
robot model used for designing the controllers: this model has just the two degrees
of freedom of the finger, and experiences unmodelled contact forces. By design, the
equations of this second model are particularly simple:
where m is the mass, g = [0, −9.81]T is the gravity vector, u is the control input vector,
and f c is the contact force (from the world to the finger). Notably absent is the
contact Jacobian which normally pre-multiplies the contact forces, ∑i JiT f ci ; in the
case of a point finger this Jacobian is always the identity matrix, and there is only one
point at which contact can be made, so there is no meaningful reason to distinguish
between multiple forces.
desired_contact_force→
robot_state→
ForceControl → robot_actuation
robot_accelerations OR
measured_contact_force→
Let us assume that the robot is already in contact. What information do we need
to regulate the contact forces? Certainly we need the desired contact force, f cd .
In general, we will need the robot's state (though in the immediate example, the
dynamics of our point mass are not state dependent). But we also need to either (1)
measure the robot accelerations (which we've repeatedly tried to avoid), (2) assume
the robot accelerations are zero, or (3) provide a measurement of the contact force
so that we can regulate it with feedback.
Let's consider the case where the robot is already in contact with the box. Let's
also assume for a moment that the accelerations are (nearly) zero. This is actually
not a horrible assumption for most manipulation tasks, where the movements are
relatively slow. In this case, our equations of motion reduce to
f c = −mg − u.
What happens when the robot is not in contact? In this case, we cannot reasonably
ignore the accelerations, and applying the same control results in mv̇ = −f cd . That's
not all bad. In fact, it's one of the defining features of force control that makes
it very appealing. When you specify a desired force and don't get it, the result is
accelerating the contact point in the (opposite) direction of the desired force. In
practice, this (locally) tends to bring you into contact when you are not in contact.
Figure 7.4 - This is a plot of the horizontal position of the finger as a function of
time, given different constant desired force commands.
For all strictly positive desired force commands, the finger will accelerate at a
constant rate until it collides with the box (at x = 0.089). For small f cd , the box
barely moves. For an intermediate range of f cd , the collision with the box is enough
to start it moving, but friction eventually brings the box and therefore also the
finger to rest. For large values, the finger will keep pushing the box until it runs
into the far wall.
Open in Colab
This is one of the reasons that researchers working on legged robots also like force
control. On a force-capable walking robot, we might mimic position control during
the "swing phase", to get our foot approximately where we are hoping to step. But
then we switch to a force control mode to actually set the foot down on the ground.
This can significantly reduce the requirements for accurate sensing of the terrain.
Here is my proposal for a simple strategy for flipping up the box. Once in contact,
we will use the contact force from the finger to apply a moment around the bottom
left corner of the box to generate the rotation. But we'll add constraints to the forces
that we apply so that the bottom corner does not slip.
Figure 7.5 - Flipping the box. Click here for an animation of the controller we'll
write in the example below. But it's worth running the code, where you can see the
contact force visualization, too!
These conditions are very natural to express in terms of forces. And once again, we
can implement this strategy with very little information about the box. The position
of the finger will evolve naturally as we apply the contact forces. It's harder for
me to imagine how to write an equally robust strategy using a (high-gain) position-
controller finger; it would likely require many more assumptions about the geometry
(and possibly the mass) of the box.
with W
RC (θ) = [− sin θ − cos θ].
cos θ − sin θ
To understand this controller, let's break down the dynamics of the task.
Once the box has rotated up off the bin bottom, the only forces on the box are
the gravitational force, the force that the bin applies on the box corner, call it f A
, and the contact force, f C , that the finger applies on the box. By convention, the
contact frames A and C have the z axis aligned with the contact normal (pointing
into the box). Remember the contact geometry we use for the cracker box has
small spheres in the corners; so both of these contacts are "sphere on box", with
the box normals defining the contact normals (e.g. the x axis is aligned with the
wall of the box), and W RA = I . The friction cone constraints are:
A A A
fW z
≥ 0, |fW x
| ≤ μA fW z
,
fCCz ≥ 0, |fCCx | ≤ μC fCCz .
I θ¨ =A pCM 0 ] + A pC × f C ,
W ×[ W W
−mg
where I is the moment of inertia of the box taken around the top corner, and pCM
is the position of the center of mass. We don't want our controller to depend on
most of these parameters, but it helps to understand them! If we move slowly (the
velocities and accelerations are nearly zero), then we can write
A C A C
fW x
= −fW x
, fW z
= mg − fW z
.
Taken together, the friction cone constraints on the bin imply the following
constraints on our finger contact:
C C
|fW x
| ≤ μA (mg − fW z
).
Note that the full dynamic derivation is not too much harder; I did it on paper by
writing CM pA = [− h2 , − w2 ]T , with the height, h, and width, w of the box. Then one
can write the no-slip at the bin constraint as ϕ(q) = CM pA Wx = constant which implies
that ϕ¨ = 0, which gives the full expression for f A (as long as we're in the friction
Wx
cone).
Now here's a nice approximation. By choosing to make contact near the bottom of
the box (since we know where the bin is), we can approximate
A C C
pW × fW ≈ −A pC C
Cz fCx .
is a good strategy for regulating the orientation of the box. It does not assume we
know the box geometry nor inertial properties.
That's a lot of details, but all to justify a very simple and robust controller.
We have multiple controllers in this example. The first is the low-level force
controller that takes a desired contact force and sends commands to the robot
to attempt to regulate this force. The second is the higher-level controller that is
looking at the orientation of the box and deciding which forces to request from the
low-level controller.
Please also understand that this is not some unique or optimal strategy for box
flipping. I'm simply trying to demonstrate that sometimes controllers which might
be difficult to express otherwise can easily be expressed in terms of forces!
This approach is conceptually very nice. With only knowledge of the parameters of
the robot itself (not the environment), we can write a controller so that when we
push on the end-effector, the end-effector pushes back (using the entire robot) as
if you were pushing on, for instance, a simple spring-mass-damper system. Rather
than attempting to achieve manipulation by moving the end-effector rigidly through
a series of position commands, we can move the set points (and perhaps stiffness)
of a soft virtual spring, and allow this virtual spring to generate our desired contact
forces.
This approach can also have nice advantages for guaranteeing that your robot won't
go unstable even in the face of unmodeled contact interactions. If the robot acts like
a dissipative system and the environment is a dissipative system, then the entire
system will be stable. Arguments of this form can ensure stability for even very
complex system, building on the rich literature on passivity theory or more generally
Port-Hamiltonian systems[5].
Our simple model with a point finger is ideal for understanding the essence of
indirect force control. The original equations of motion of our system are
m[x¨] = u + f c .
z¨
We can write a simple controller to make this system act, instead, like (passive)
mass-spring-damper system:
with the rest position of the spring at (xd , zd ). The controller that implements this
follows easily; in the point finger case this has exactly the familiar form of a
proportional-derivative (PD) controller!
Technically, if we are just programming the stiffness and damping, as I've written
here, then a controller of this form would commonly be referred to as "stiffness
control", which is a subset of impedance control. We could also change the effective
mass of the system; this would be impedance control in its full glory. My impression,
though, is that changing the effective mass is most often considered not worth the
complexity that comes from the extra sensing and bandwidth requirements.
The literature on indirect force control has a lot of terminology and implementation
details that are important to get right in practice. Your exact implementation will
depend on, for instance, whether you have access to a force sensor and whether you
can command forces/torque directly. The performance can vary significantly based
on the bandwidth of your controller and the quality of your sensors. See e.g. [6] for
a more thorough survey (and also some fun older videos), or [7] for a nice earlier
perspective.
Open in Colab
To help your intuition, I've made the bin and the box slightly transparent, and
added a visualization (in orange) of the virtual finger or gripper that you are
moving with the sliders.
Let's embrace indirect force control to come up with another approach to flipping up
the box. Flipping the box up in the middle of the bin required very explicit reasoning
about forces in order to stay inside the friction cone in the bottom left corner of
the box. But there is another strategy that doesn't require as precise control of the
forces. Let's push the box into the corner, and then flip it up.
To make this one happen, I'd like to imagine creating a virtual spring -- you can think
of it like a taught rubber band -- that we attach from the finger to a point near the
wall just a little above the top left corner of the box. The box will act like a pendulum,
rotating around the top corner, with the rubber band creating the moment. At some
point the top corner will slip, but the very same rubber band will have the finger
pushing the box down from the top corner to complete the maneuver.
Consider the alternative of writing an estimator and controller that needed to detect
the moment of slip and make a new plan. That is not a road to happiness. By only
using our model of the robot to make the robot act like a different dynamical system
at the point we can accomplish all of that!
Open in Colab
There are a number of applications where we would like to explicitly command force
in one direction, but command positions in another. One classic examples if you are
trying to wipe or polish a surface -- you might care about the amount of force you are
applying normal to the surface, but use position feedback to follow a trajectory in
the directions tangent to the surface. In the simplest case, imagine controlling force
in z and position in x for our point finger:
kp (xd − x) + kd (ẋd − x
˙)
u = −mg + [ ].
Cd
−fWz
kp (pCd C Cd C
Cx − pCx ) + kd (vCx − vCx )
u = −mg + W RC [ ].
−fCCzd
By commanding the normal force, you not only have the benefit of controlling
how hard the robot is pushing on the surface, but also gain some robustness to
errors in estimating the surface normal. If a position-controlled robot estimated
the normal of a wall badly, then it might leave the wall entirely in one direction,
and push extremely hard in the other. Having a commanded force in the normal
direction would allow the position of the robot in that direction to become whatever
is necessary to apply the force, and it will follow the wall.
The choice of position control or force control need not be a binary decision. We can
simply apply both the position command and a "feed-forward" force command:
ff
u = −mg + W RC [Kp (pCd C Cd C
C − pC ) + Kd (vC − vC ) + fC ].
As we'll see, this is quite similar to the interface provided by the iiwa (and many
other torque-controlled robots).
One of the classic problems for manipulator force control, inspired by robotic
assembly, is the problem of mating two parts together with under tight kinematic
tolerances. Perhaps the cleanest and most famous version of this is the problem
of inserting a (round) peg into a (round) hole. If your peg is square and your hole
is round, I can't help. Interestingly, much of the literature on this topic in the late
1970's and early 1980's came out of MIT and the Draper Laboratory.
Figure 7.7 - (Left) the "peg in hole" insertion problem with kinematic tolerance c.
(Middle) the desired response to a contact force. (Right) the desired response to a
contact moment. Figures reproduced (with some symbols removed) from [8].
For many peg insertion tasks, it's acceptable to add a small chamfer to the part
and/or the top of the hole. This allows small errors in horizontal alignment to be
accounted for with a simple compliance in the horizontal direction. Misalignment in
orientation is even more interesting. Small misalignments can cause the object to
become "jammed", at which point the frictional forces become large and ambiguous;
we really want to avoid this. Again, we can avoid jamming by allowing a rotational
compliance at the contact. I think you can see why this is a great example for force
control!
The key insight here is that the rotational stiffness that we want should result in
rotations around the part/contact, not around the gripper. We can program this
with stiffness control; but for very tight tolerances the sensing and bandwidth
requirements become a real limitation. One of the coolest results in force control
is the realization that a clever physical mechanism can be used to produce a
remote-centered compliance completely mechanically, with infinite bandwidth and
no sensing required [7]!
Using the floating finger/gripper is a good way to understand the main concepts
of force control without the details. But now it's time to actually implement those
strategies using joint commands that we can send to the arm.
Our starting point is understanding that the equations of motion for a fully-actuated
robotic manipulator have a very structured form:
M(q)q̈ + C(q, q˙)q̇ = τg (q) + u + ∑ JiT (q)f ci . (1)
i
The left side of the equation is just a generalization of "mass times acceleration",
with the mass matrix, M , and the Coriolis terms C . The right hand side is the sum
of the (generalized) forces, with τg (q) capturing the forces due to gravity, u the joint
torques supplied by the motors, and f ci is the Cartesian force due to the ith contact,
where Ji (q) is the ith "contact Jacobian". I introduced a version of these briefly when
we described multibody dynamics for dropping random objects into the bin, and
have more notes available here. For the purposes of the remainder of this chapter,
we can assume that the robot is bolted to the table, so does not have any floating
joints; I've therefore used q˙ and q¨ instead of v and v˙.
In practice, the way that we most often interface with the iiwa is through the its
"joint-impedance control" mode, which is written up nicely in [10, 11]. For our
purposes, we can view this as a stiffness control in joint space:
Check yourself: What is the difference between traditional position control with
a PD controller and joint-stiffness control?
The difference in the algebra is quite small. A PD control would typically have the
form
In other words, stiffness control tries to cancel out the gravity and any other
estimated terms, like friction, in the model. As written, this obviously requires an
estimated model (which we have for iiwa, but don't believe we have for most arms
with large gear-ratio transmissions) and torque control. But this small difference
in the algebra can make a profound difference in practice. The controller is no
longer depending on error to achieve the joint position in steady state. As such
we can turn the gains way down, and in practice have a much more compliant
response while still achieving good tracking when there are no external torques.
A better analogy for the control we were doing with the point finger example is to
write a controller so that the robot acts like a simple dynamical system in the world
frame. To do that, we have to identify a frame, C on the robot where we want to
impose these simple dynamics -- origin of this frame is the expected point of contact.
Following our treatment of kinematics and differential kinematics, we'll define the
forward kinematics of this frame as:
One of the big ideas from manipulator control is that we can actually write the
dynamics of the robot in the frame C , by writing the joint torques in terms of a
spatial force command, F u : u = J T (q)F u . By substituting this and the manipulator
equations (1) into (2) and assuming that the only external contacts happen at C , we
can write the dynamics:
where
Now we can simply apply the controller we used in joint space, e.g.:
Please don't forget to add some terms to stabilize the null space of your Jacobian if
necessary.
The implementation of the low-level impedance controllers has many details that
are explained nicely in [9, 10]. In particular, the authors go to quite some length
to implement the impedance law in the actuator coordinates rather than the joint
coordinates (remember that they have an elastic transmission in between the two).
I suspect there are many subtle reasons to prefer this, but they go to lengths to
demonstrate that they can make a true passivity argument with this controller.
The iiwa interface offers a Cartesian impedance control mode. If we want high
performance stiffness control in end-effector coordinates, then we should definitely
use it! The iiwa controller runs at a much higher bandwidth (faster update rate)
than the interface we have over the provided network API, and many implementation
details that they have gone to great lengths to get right. But in practice we do not
use it, because we cannot convert between joint impedance control and Cartesian
impedance control without shutting down the robot. Sigh. In fact we cannot even
change the stiffness gains nor the frame C (aka the "end-effector location") without
stopping the robot. So we stay in joint impedance mode and command some
Cartesian forces through τff if we desire them. (If you are interested in the driver
details, then I would recommend the documentation for the Franka Control Interface
which is much easier to find and read, and is very similar to functionality provided
by the iiwa driver.)
You might also notice that the interface we provide to the ManipulationStation
takes a desired joint position for the robot, but not a desired joint velocity. That is
because we cannot actually send a desired joint velocity to the iiwa controller. In
practice, we believe that they are numerically differentiating our position commands
to obtain the desired velocity, which adds a delay of a few control timesteps (and
sometimes non-monotonic behavior). I don't really know why we aren't allowed to
send our own joint velocity commands.
7.4 PUTTING IT ALL TOGETHER
7.5 EXERCISES
mẍ = u + f c
where x is the position, u the input force applied on the point-mass, and f c the
contact force from the environment. In order to control the position and force
from the robot at the same time, the following controller is proposed:
u= kp (xd − x) − kd x
˙ − fc
d
feedback force for position control feedforward force
e = (x − xd )2 + (f c − fdc )2
a. In free-space (i.e. f c = 0), can we achieve zero error for xd = 0 and a non-
zero desired force fdc ? Show your answer by writing down the steady-
state (x ¨ = 0) error as a function of the gains and the desired force, thus
˙, x
arguing that both terms of the error do not achieve zero (i.e. we cannot
achieve desired position or force).
b. Assume a rigid-body contact with a wall located at xd = 0, where a
desired non-zero force of fdc is commanded. By writing down the error at
equilibrium, show that we are able to achieve zero-error.
a. If we do not push hard enough (i.e. fCCz is too small), then we will not be
able to create enough friction (fCCx ) to counteract gravity. By summing
the torques around the pivot point A, derive a lower bound for the
required normal force fCCz , as a function of m, g, and μC . You may assume
that the moment arm of gravity is half that of the moment arm on C, and
that the box is square.
b. If we push too hard (i.e. fCCz is too big), then the pivot point will end up
sliding to the left. By writing the force balance along the axii, derive an
upper bound for the required normal force fCCz , as a function of m, g, and
μA . (HINT: It will be necessary to use the torque balance equation from
before)
c. By combining the lower bounds and upper bounds from the previous
problems, derive a relation for μA , μC that must hold in order for this
motion to be possible. If μA ≈ 0.25 (wood-metal), how high does μC have
to be in order for the motion to be feasible? What if μA = 1 (concrete-
rubber)?
REFERENCES
Motion Planning
There are a few more essential skills that we need in our toolbox. In this chapter, we
will explore some of the powerful methods of kinematic trajectory motion planning.
I'm actually almost proud of making it this far into the notes without covering this
topic yet. Writing a relatively simple script for the pose of the gripper, like we did in
the bin picking chapter, really can solve a lot of interesting problems. But there are
a number of reasons that we might want a more automated solution:
We'll focus on the problem of moving an arm through space in this chapter, because
that is the classical and very important motivation. But I need to cover this now
for another reason, too: it will also soon help us think about programming a more
dexterous hand.
I do need to make one important caveat. Despite having done some work in this
field myself, I actually really dislike the problem formulation of collision-free motion
planning. I think that on the whole, robots are too afraid of bumping into the world
(because things still go wrong when they do). I don't think humans are solving these
complex geometric problems every time we reach... even when we are reaching in
dense clutter (I actually suspect that we are very bad at it). I would much rather
see robots that perform well even with very coarse / approximate plans for moving
through a cluttered environment, that are not afraid to make incidental contacts,
and that can still accomplish the task when they do!
We know that that the forward kinematics give us a (nonlinear) mapping from joint
angles to e.g. the pose of the gripper: X G = fkin (q). So, naturally, the problem of
inverse kinematics (IK) is about solving for the inverse map, q = fkin −1
(X G ). Indeed,
that is perhaps the most common and classical question studied in inverse
kinematics. But I want you to think of inverse kinematics as a much richer topic than
that.
For example, when we were evaluating grasp candidates for bin picking, we had only
a soft preference on the orientation of the hand relative to some antipodal grasp. In
that case, a full 6 DOF pose would have been an overly constrained specification.
And often we have many constraints on the kinematics: some in joint space (like
joint limits) and others in Cartesian space (like non-penetration constraints). So
really, inverse kinematics is about solving for joint angles in a very rich landscape of
objectives and constraints.
Click here to watch the video.
Figure 8.1 - We made extensive use of rich inverse kinematics specifications in our
work on humanoid robots. The video above is an example of the interactive inverse
kinematics interface (here to help us figure out how to fit the our big humanoid
robot into the little Polaris). Here is another video of the same tool being used for
the Valkyrie humanoid, where we do specify and end-effector pose, but we also add
a joint-centering objective and static stability constraints [1, 2].
With its obvious importance in robotics, you probably won't be surprised to hear
that there is an extensive literature on inverse kinematics. But you may be surprised
at how extensive and complete the solutions can get. The forward kinematics, fkin
, is a nonlinear function in general, but it is a very structured one. In fact, with
rare exceptions (like if your robot has a helical joint, aka screw joint), the equations
governing the valid Cartesian positions of our robots are actually polynomial. "But
wait! What about all of those sines and cosines in my kinematic equations?" you say.
The trigonometric terms come when we want to relate joint angles with Cartesian
coordinates. In R3 , for two points, A and B, on the same rigid body, the (squared)
distance between them, ∥pA − pB ∥2 , is a constant. And a joint is just a polynomial
constraint between positions on adjoining bodies, e.g. that they occupy the same
point in Cartesian space. See [3] for an excellent overview.
While the algebraic-geometry methods are mostly targeted for offline global
analysis, they are not designed for fast real-time inverse kinematics solutions needed
in a control loop. The most popular tool these days for real-time inverse kinematics
for six- or seven-DOF manipulators is a tool called "IKFast", described in Section 4.1
of [4], that gained popularity because of its effective open-source implementation.
Rather than focus on completeness, IKFast uses a number of approximations to
provide fast and numerically robust solutions to the "easy" kinematics problems. It
leverages the fact that a six-DOF pose constraint on a six-DOF manipulator has a
closed-form solutions with a finite number of joint space configurations that produce
the same end-effector pose, and for seven-DOF manipulators it adds a layer of
sampling in the last degree of freedom on top of the six-DOF solver.
These closed-form solutions are important to understand because they provide deep
insight into the equations, and because they can be fast enough to use inside a more
sophisticated solution approach. But the closed-form solutions don't provide the rich
specification I advocated for above; in particular, they break down once we have
inequality constraints instead of equality constraints. For those richer specifications,
we will turn to optimization.
8.1.2 IK as constrained optimization
Figure 8.2 - A richer inverse kinematics problem: solve for the joint angles, q, that
allow the robot to reach into the shelf and grab the object, while avoiding
collisions.
Global IK.
REFERENCES
[1] Maurice Fallon, Scott Kuindersma, Sisir Karumanchi, Matthew Antone, Toby
Schneider, Hongkai Dai, Claudia Pérez D'Arpino, Robin Deits, Matt DiCicco,
Dehann Fourie, Twan Koolen, Pat Marion, Michael Posa, Andrés Valenzuela,
Kuan-Ting Yu, Julie Shah, Karl Iagnemma, Russ Tedrake, and Seth Teller. An
architecture for online affordance-based perception and whole-body planning.
Journal of Field Robotics, 32(2):229--254, September 2014. [ www: ]
[2] Pat Marion, Maurice Fallon, Robin Deits, Andrés Valenzuela, Claudia Pérez
D'Arpino, Greg Izatt, Lucas Manuelli, Matt Antone, Hongkai Dai, Twan Koolen,
John Carter, Scott Kuindersma, and Russ Tedrake. Director: A user interface
designed for robot operation with shared autonomy. Journal of Field Robotics,
1556-4967, 2016. [ www: ]
[3] Charles W. Wampler and Andrew J. Sommese. Numerical algebraic geometry and
algebraic kinematics. Acta Numerica, 20:469--567, 2011.
[4] Rosen Diankov. Automated Construction of Robotic Manipulation Programs. PhD
thesis, Carnegie Mellon University, August 2010.
CHAPTER 9
Reinforcement Learning
This is a placeholder chapter. Content to be filled in. In the mean time, I have some
relevant notes here.
9.1 EXERCISES
A.1 PYDRAKE
DRAKE is primarily a C++ library, with rigorous coding standards and a maturity
level intended to support even professional applications in industry. In order to
provide a gentler introduction, and to facilitate rapid prototyping, I've written these
notes exclusively in python, using Drake's python bindings (pydrake). These bindings
are less mature than the C++ backend; your feedback (and even contributions) are
very welcome. It is still improving rapidly.
There are also a number of tutorials in DRAKE that can help you get started.
I will provide nearly all examples and exercises in the form of a Jupyter Notebook
so that we can leverage the fantastic and relatively recent availability of (free)
cloud resources. In particular I have chosen to focus on using Google's Colaboratory
(Colab) as our primary platform. Most of the chapters have a corresponding
notebook that you can run on the cloud without installing anything on your machine!
As you get more advanced, you will want to run (and extend) these examples on your
own machine. The DRAKE website has a number of installation options, including
precompiled binaries and Docker instances. Here I provide an example workflow of
setting up drake from the precompiled binaries and running the textbook examples.
The links below indicate the specific distributions that the examples in this text have
been tested against.
and install the prerequisites using the platform-specific installation script provided:
cd manipulation
sudo scripts/setup/ubuntu/18.04/install_prereqs.sh
pip3 install --requirement requirements.txt
export PYTHONPATH=`pwd`:${PYTHONPATH}
You'll likely want to start from the manipulation root directory. Then launch your
notebook with:
jupyter notebook
The examples for each chapter that has them will be in a .ipynb file right alongside
the chapter's html file, and the notebook exercises are all located in theexercises
subdirectory.
If you have trouble with DRAKE, please follow the advice here. If you have trouble
with the manipulation repo, you can check for known issues (and potentially file a
new one) here.
APPENDIX B
Setting up your own
"Manipulation Station"
Simulation is an extremely powerful tool, and the fact that we can give it away
freely makes it a powerful equalizer. Just a few years ago, nobody believed that you
could develop a manipulation system in simulation and expect it to work in reality.
Nowadays, though, we are seeing consistent evidence that our simulation fidelity
is high enough that if can achieve robust performance in simulation, then we are
optimistic that it can transfer to reality.
As a result, the primary drivers we list/provide below are based on LCM. I'll try
to point to ROS equivalents when possible. If you already have an established ROS
workflow and still want to use our drivers, then it is also possible to use simple
lcm2ros/ros2lcm "bridge" executables to e.g. publish the LCM messages on a ROS
topic (for logging, etc).
We've been quite happy with the Schunk WSG 50 gripper as a reliable and accurate
two-fingered gripper solution.
Note: We are evaluating the Franka Panda as an alternative platform (and are likely
to provide drivers for it here soon).
For the purposes of having a common workstation that we can use in the labs, we
have included here the simple environment in which we mount the robot. The robot
just barely fits in the arena -- it was designed with the dimensions limited so that it
can be moved (via a manual pallet jack) through standard office doors.
Thank you for citing these notes in your work. Please use the following citation:
My primary goal for the annotation tool is to host a completely open dialogue on
the intellectual content of the text. However, it has turned out to serve an additional
purpose: it's a convenient way to point out my miscellaneous typos and grammatical
blips. The only problem is that if you highlight a typo, and I fix it 10 minutes
later, your highlight will persist forevermore. Ultimately this pollutes the annotation
content.
• you can make a public editorial comment, but must promise to delete that
comment once it has been addressed.
• You can join my "editorial" group and post your editorial comments using
this group "scope".
Ideally, once I mark a comment as "done", I would appreciate it if you can delete that
comment.
I highly value both the discussions and the corrections. Please keep them coming,
and thank you!
I'm very interested in your feedback. The annotation tool is one mechanism, but you
can also comment directly on the YouTube lectures, or even add issues to the github
repo that hosts these course notes.
I will also post a proper survey questionnaire here once the notes/course has existed
long enough for that to make sense.