Learning Human Activities and Object Affordances From RGB-D Videos
Learning Human Activities and Object Affordances From RGB-D Videos
Abstract Understanding human activities and object affordances are two very important skills, especially for personal
robots which operate in human environments. In this work,
we consider the problem of extracting a descriptive labeling
of the sequence of sub-activities being performed by a human,
and more importantly, of their interactions with the objects in
the form of associated affordances. Given a RGB-D video, we
jointly model the human activities and object affordances as a
Markov Random Field where the nodes represent objects and
sub-activities, and the edges represent the relationships between
object affordances, their relations with sub-activities, and their
evolution over time. We formulate the learning problem using a
structural SVM approach, where labeling over various alternate
temporal segmentations are considered as latent variables. We
tested our method on a challenging dataset comprising 120
activity videos collected from four subjects, and obtained an
end-to-end precision of 75.8% and recall of 74.2% for labeling
the activities. We then demonstrate the use of such descriptive
labeling in performing assistive tasks by a PR2 robot.1
I. I NTRODUCTION
It is indispensable for a personal robot to perceive the
environment in order to perform assistive tasks. Recent works
in this area have addressed tasks such as estimating geometry
(Henry et al., 2012), tracking objects (Choi and Christensen,
2012), recognizing objects (Collet et al., 2011), and labeling
geometric classes (Anand et al., 2012). Beyond geometry
and objects, humans are an important part of the indoor
environments. Building upon the recent advances in human
pose detection from an RGB-D sensor (Shotton et al., 2011),
in this paper we present learning algorithms to detect the
human activities and object affordances. This information
can then be used by assistive robots as shown in Fig. 1.
Most prior works in human activity detection have focussed on activity detection from still images or from 2D
videos. Estimating the human pose is the primary focus of
these works, and they consider actions taking place over
shorter time scales (see Section II). Having access to a
3D camera, which provides RGB-D videos, enables us to
robustly estimate human poses and use this information for
learning complex human activities.
Our focus in this work is to obtain a descriptive labeling
of the complex human activities that take place over long
time scales and consist of a long sequence of sub-activities,
such as making cereal and arranging objects in a room
(see Fig. 8). For example, making cereal activity consists of
around 12 sub-activities on average, which includes reaching
1 A first version of this work was made available on arXiv (Koppula et al.,
2012) for faster dissemination of scientific work.
the pitcher, moving the pitcher to the bowl, and then pouring
the milk into the bowl. This proves to be a very challenging
task given the variability across individuals in performing
each sub-activity, and other environment induced conditions
such as cluttered background and viewpoint changes. (See
Fig. 2 for some examples.)
In most previous works, object detection and activity
recognition have been addressed as separate tasks. Only
recently, some works have shown that modeling mutual
context is beneficial (Gupta et al., 2009; Yao and Fei-Fei,
2010). The key idea in our work is to note that, in activity
detection, it is sometimes more informative to know how
an object is being used (associated affordances, Gibson,
1979) rather than knowing what the object is (i.e., the object
category). For example, both chair and sofa might be categorized as sittable, and a cup might be categorized as both
drinkable and pourable. Note that the affordances of an
object change over time depending on its use, e.g., a pitcher
may first be reachable, then movable and finally pourable. In
addition to helping activity recognition, recognizing object
affordances is important by itself because of their use in
robotic applications (e.g., Kormushev et al., 2010).
We propose a method to learn human activities by modeling the sub-activities and affordances of the objects, how
they change over time, and how they relate to each other.
More formally, we define a Markov Random Field over two
kinds of nodes: object and sub-activity nodes. The edges
Fig. 2. Significant Variations, Clutter and Occlusions: Example shots of reaching sub-activity from our dataset. First and third rows show the RGB
images, and the second and bottom rows show the corresponding depth images from the RGB-D camera. Note that there are significant variations in the
way the subjects perform the sub-activity. In addition, there is significant background clutter and subjects are partially occluded (e.g., column 1) or not
facing the camera (e.g., row 1 column 4) in many instances.
We present our learning, inference and temporal segmentation algorithms in Section VIII. We present the experimental
results along with robotic demonstrations in Section IX and
finally conclude the paper in Section X.
II. R ELATED W ORK
There is a lot of recent work in improving robotic perception in order to enable the robots to perform many
useful tasks. These works includes 3D modeling of indoor
environments (Henry et al., 2012), semantic labeling of
environments by modeling objects and their relations to other
objects in the scene (Koppula et al., 2011; Lai et al., 2011b;
Anand et al., 2012; Rosman and Ramamoorthy, 2011), developing frameworks for object recognition and pose estimation
for manipulation (Collet et al., 2011), object tracking for
3D object modeling (Krainin et al., 2011), etc. Robots are
now becoming more integrated in human environments and
are being used in assistive tasks such as automatically interpreting and executing cooking recipes (Bollini et al., 2012),
robotic laundry folding (Miller et al., 2011) and arranging
a disorganized house (Jiang et al., 2012). Such applications
makes it critical for the robots to understand both object
affordances as well as human activities in order to work
alongside with humans. We describe the recent advances in
the various aspects of this problem here.
Object affordances. An important accept of the human
environment a robot needs to understand is the object affordances. Most of the work within the robotics community
related to affordances has focused on predicting opportunities
for interaction with an object either by using visual clues
(Sun et al., 2009; Hermans et al., 2011; Aldoma et al., 2012)
or through observation of the effects of exploratory behaviors
(Ridge et al., 2009; Moldovan et al., 2012; Montesano et al.,
2008). For instance, Sun et al. (2009) propose a probabilistic
graphical model that leverages visual object categorization
for learning affordances and Hermans et al. (2011) propose
the use of physical and visual attributes as a mid-level representation for affordance prediction. Aldoma et al. (2012)
propose a method to find affordances which depends solely
on the objects of interest and their position and orientation
in the scene. These methods, do not consider the object
affordances in human context, i.e., how the objects are usable
by humans. We show that human-actor based affordances
are essential for robots working in human spaces in order
for them to interact with objects in a human-desirable way.
There is some recent work in interpreting human actions
and interaction with objects (Aksoy et al., 2011; Konidaris
et al., 2012) in context of learning to perform actions from
demonstrations. In contrast to these methods, we propose a
model to learn human activities spanning over long durations
and action dependent affordances which make robots more
capable in performing assistive tasks as we later describe in
Section IX-E.
Human activity detection from 2D videos. There has been
a lot of work on human activity detection from images (Yang
et al., 2010; Yao et al., 2011) and from videos (Liu et al.,
PCKFDUGFBUVSFT
TVCBDUJWJUZPCKFDU
JOUFSBDUJPOT
TVCBDUJWJUZPCKFDU
PCKFDUPCKFDU JOUFSBDUJPOT
*OUFSBDUJPOT
PCKFDUGFBUVSFT
PCKFDUPCKFDU
*OUFSBDUJPOT
TVCBDUJWJUZPCKFDU
JOUFSBDUJPOT
TVCBDUJWJUZPCKFDU
JOUFSBDUJPOT
TVCBDUJWJUZPCKFDU
JOUFSBDUJPOT
UFNQPSBMJOUFSBDUJPOT
PCKFDUPCKFDU
*OUFSBDUJPOT
UFNQPSBMJOUFSBDUJPOT
TVCBDUJWJUZ
GFBUVSFT
TVCBDUJWJUZPCKFDU
JOUFSBDUJPOT
Fig. 3. Pictorial representation of the different types of nodes and relationships modeled in part of the cleaning objects activity comprising three
sub-activities: reaching, opening and scrubbing. (See Section III.)
III. OVERVIEW
Over the course of a video, a human may interact with
several objects and perform several sub-activities over time.
In this section we describe at a high level how we process
the RGB-D videos and model the various properties for
affordance and activity labeling.
Given the raw data containing the color and depth values
for every pixel in the video, we first track the human skeleton
using Opennis skeleton tracker2 for obtaining the locations
of the various joints of the human skeleton. However these
values are not very accurate, as the Opennis skeleton tracker
is only designed to track human skeletons in clutter-free
environments and without any occlusion of the body parts. In
real-world human activity videos, some body parts are often
occluded and the interaction with the objects hinders accurate
skeleton tracking. We show that even with such noisy data,
our method gets high accuracies by modeling the mutual
context between the affordances and sub-activities.
We then segment the object being used in the activity and
track them through out the 3D video, as explained in detail
in Section V. We model the activities and affordances by
defining a Markov Random Field (MRF) over the spatiotemporal sequence we get from an RGB-D video, as illustrated in Fig. 3. If we build our graph with nodes for objects
and sub-activities for each time instant (at 30fps), then we
will end up with quite a large graph. Furthermore, such a
graph would not be able to model meaningful transitions
between the sub-activities because they take place over a
long-time (e.g., a few seconds). Therefore, in our approach
we first segment the video into small temporal segments,
and our goal is to label each segment with appropriate
labels. We try to over-segment, so that we end up with
more segments and avoid merging two sub-activities into
one segment. Each of these segments occupies a small length
of time and therefore, considering nodes per segment gives
2 http://openni.org
(1)
y
t
t
Ew (x, y) = Eo + Ea + Eoo + Eoa + Eoo
+ Eaa
(2)
o (i) =
iVo
X X
h
i
yik wok o (i) ,
(3)
iVo kKo
X
iVa
a (i) =
X X
h
i
yik wak a (i) ,
(4)
iVa kKa
where a (i) denotes the feature map describing the subactivity node i through a vector of features, and there is
one weight vector for each sub-activity class in Ka .
For all segments s, there is an edge connecting all the
nodes in Vos to each other, denoted by Eoo , and to the subactivity node vas , denoted by Eoa . These edges signify the
relationships within the objects, and between the objects
and the human pose within a segment and are referred to
as object - object interactions and sub-activity - object
interactions in the Fig. 3 respectively.
The sub-activity node of segment s is connected to the
sub-activity nodes in segments (s 1) and (s + 1). These
t
temporal edges are denoted by Eaa
. Similarly every object
node of segment s is connected to the corresponding object
t
nodes in segments (s1) and (s+1), denoted by Eoo
. These
edges model the temporal interactions between the human
poses and the objects respectively and are represented by
doted edges in the Fig. 3.
We have one energy term for each of the four interaction
types and are defined as:
Eoo =
h
i
lk
yil yjk woo
oo (i, j) ,
(5)
h
i
lk
yil yjk woa
oa (i, j) ,
(6)
h lk
i
yil yjk wt oo too (i, j) ,
(7)
h lk
i
yil yjk wt aa taa (i, j) .
(8)
(i,j)Eoo (l,k)Ko Ko
Eoa =
(i,j)Eoa (l,k)Ko Ka
t
Eoo
=
t (l,k)Ko Ko
(i,j)Eoo
t
Eaa
=
t (l,k)Ka Ka
(i,j)Eaa
Object Tracking: We used the particle filter tracker implementation3 provided under the PCL library for tracking
i X X
h
i our target object. The tracker uses the color values and the
X X kh k
Ew (x, y) =
yi wa a (i) +
yik wok o (i)
normals to find the next probable state of the object.
iV kK
iV kK
a
X X
o
o
h
i
yil yjk wtlk t (i, j)
3 http://www.willowgarage.com/blog/2012/01/17/tracking-3d-objectspoint-cloud-library
4 In our current implementation, this method needs an initial guess on
the 2D bounding boxes of the objects to keep the algorithm tractable. We
can obtain this by considering only the tabletop objects by using a tabletop
object segmenter (Rusu et al., 2009, e.g.,). We initialize the graph with these
guesses.
(9)
tT (i,j)Et (l,k)Tt
(10)
ep
TABLE I
S UMMARY OF THE F EATURES USED IN THE E NERGY F UNCTION .
Description
Object Features
N1. Centroid location
N2. 2D bounding box
N3. Transformation matrix of SIFT matches between
adjacent frames
N4. Distance moved by the centroid
N5. Displacement of centroid
Sub-activity Features
N6. Location of each joint (8 joints)
N7. Distance moved by each joint (8 joints)
N8. Displacement of each joint (8 joints)
N9. Body pose features
N10. Hand position features
Object-object Features (computed at start frame,
middle frame, end frame, max and min)
E1. Difference in centroid locations (x, y, z)
E2. Distance between centroids
Objectsub-activity Features (computed at start
frame, middle frame, end frame, max and min)
E3. Distance between each joint location and object
centroid
Object Temporal Features
E4. Total and normalized vertical displacement
E5. Total and normalized distance between centroids
Sub-activity Temporal Features
E6. Total and normalized distance between each
corresponding joint locations (8 joints)
Count
18
3
4
6
1
1
103
24
8
8
47
16
20
3
1
i
h
i X X
h
k
o (i)
yik wo
yik wak a (i) +
X X
iVo kKo
iVa kKa
40
lk
zij
i
wtlk t (i, j)
(12)
tT (i,j)Et (l,k)Tt
8
4
2
2
16
16
lk
lk
lk
lk l
i, j, l, k : zij
yil , zij
yjk , yil + yjk zij
+ 1, zij
, yi {0, 1}
(13)
Note that the products yil yjk have been replaced by auxlk
lk
iliary variables zij
. Relaxing the variables zij
and yil to
the interval [0, 1] results in a linear program that can be
shown to always have half-integral solutions (i.e., yil only
take values {0, 0.5, 1} at the solution) (Hammer et al., 1984).
Since every node in our experiments has exactly one class
label, we also consider the linear relaxation
P from labove with
the additional
constraints
i
V
:
a
lKa yi = 1 and
P
i Vo : lKo yil = 1. This problem can no longer be
solved via graph cuts. We compute the exact mixed integer
solution including these additional constraint using a generalpurpose MIP solver7 during inference. The MIP solver takes
10.7 seconds on an average for one video (a typical video
has a graph with 17 sub-activity nodes and 592 object nodes,
i.e., 6090 variables).
B. Learning.
We take a large-margin approach to learning the parameter vector w of Eq. (9) from labeled training examples
(x1 , y1 ), ..., (xM , yM ) (Taskar et al., 2004; Tsochantaridis
et al., 2004). Our method optimizes a regularized upper
bound on the training error
R(h) =
1
M
PM
m=1
m ),
(ym , y
iVo
kKo
|yik yik | +
iVa
kKa
|yik yik |.
lk
t (i, j) into
and wtlk into w and the yik a (i), yik o (i) and zij
lk
(x, y), where each zij is consistent with Eq. (13) given y.
Training can then be formulated as the following convex
quadratic program (Joachims et al., 2009):
min
min
w,
s.t.
1 T
w w + C
(14)
2
N K
M {0, 0.5, 1}
y1 , ..., y
:
M
X
1 T
m )] (ym , y
m)
w
[(xm , ym ) (xm , y
M
m=1
argmax
i
wT (xm , y) + (ym , y) .
(15)
y{0,0.5,1}N K
hn H
k K :
s.t.
|H|
X
nk = 1
(18)
n=1
)
argmax Ewhn (xhn , yhn ) + gn (yhn , y
(19)
argmax gn (
yhn , y)
(20)
yhn
C. Multiple Segmentations
gn (yhn , y) =
XX
nk yihn k yik
(16)
kK iV
IX. E XPERIMENTS
where K = Ks Ka . Here, nk can be interpreted as
the confidence of labeling the segments of label k correctly
in the nth segmentation hypothesis. We want to find the
labeling that maximizes the assignment score across all the
segmentations. Therefore we can write inference in terms of
a joint objective function as follows
A. Data
Fig. 5. Example shots of reaching (first row), placing (second row), moving (third row), drinking (fourth row) and eating (fourth row) sub-activities from
our dataset. There are significant variations in the way the subjects perform the sub-activity.
TABLE II
D ESCRIPTION OF ACTIVITIES IN TERMS OF SUB - ACTIVITIES . N OTE THAT SOME ACTIVITIES CONSIST OF SAME SUB - ACTIVITIES BUT
ARE EXECUTED IN DIFFERENT ORDER .
Making Cereal
Taking Medicine
Stacking Objects
Unstacking Objects
Microwaving Food
Picking Objects
Cleaning Objects
Taking Food
Arranging Objects
Having a Meal
reaching
X
X
X
X
X
X
X
X
X
X
moving
X
X
X
X
X
X
X
X
X
placing
X
X
X
X
X
X
X
opening
closing
X
X
X
X
eating
drinking
pouring
X
scrubbing
null
X
X
X
X
X
X
X
X
X
X
Fig. 6. Tracking Results: Blue dots represent the trajectory of the center of tracked bounding box and red dots represent the trajectory of the center of
ground-truth bounding box. (Best viewed in color.)
40%
49.2
53.5
20%
65.7
69.4
10%
75
77.8
TABLE IV
R ESULTS ON C ORNELL ACTIVITY DATASET (S UNG ET AL ., 2012), TESTED ON New Person DATA FOR 12 ACTIVITY CLASSES .
bathroom
prec
rec
72.7
65.0
88.9
61.1
bedroom
prec
rec
76.1
59.2
73.0
66.7
kitchen
prec
rec
64.4
47.9
96.4
85.4
living room
prec
rec
52.6
45.7
69.2
68.7
office
prec
rec
73.8
59.8
76.7
75.0
Average
prec
rec
67.9
55.5
80.8
71.4
TABLE V
Results on our CAD-120 dataset, SHOWING AVERAGE MICRO PRECISION / RECALL , AND AVERAGE MACRO PRECISION AND RECALL FOR
AFFORDANCE , SUB - ACTIVITIES AND HIGH - LEVEL ACTIVITIES . S TANDARD ERROR IS ALSO REPORTED .
Object Affordance
micro
macro
method
P/R
Prec.
Recall
max class
65.7 1.0 65.7 1.0
8.3 0.0
image only
74.2 0.7 15.9 2.7 16.0 2.5
75.6 1.8 40.6 2.4 37.9 2.0
SVM multiclass
MEMM (Sung et al., 2012)
object only
86.9 1.0 72.7 3.8 63.1 4.3
sub-activity only
no temporal interactions
87.0 0.8 79.8 3.6 66.1 1.5
no object interactions
88.4 0.9 75.5 3.7 63.3 3.4
91.8 0.4 90.4 2.5 74.2 3.1
full model: groundtruth seg
full model: groundtruth seg + tracking 88.2 0.6 74.5 4.3 64.9 3.5
full,
full,
full,
full,
micro
P/R
29.2 0.2
56.2 0.4
58.0 1.2
71.9 0.8
76.0 0.6
85.3 1.0
86.0 0.9
82.5 1.4
Sub-activity
macro
Prec.
Recall
29.2 0.2 10.0 0.0
39.6 0.5 41.0 0.6
47.0 0.6 41.6 2.6
60.9 2.2 51.9 0.9
74.5 3.5 66.7 1.4
79.6 2.4 74.6 2.8
84.2 1.3 76.9 2.6
72.9 1.2 70.5 3.0
movable
stationary
reachable
pourable
pourto
containable
drinkable
openable
placeable
closable
scrubbable
scrubber
.94 .05
.01 .98 .01
.01 .24 .75
.03 .03
.84
.03
.06
.16
.84
.58 .04
.33
.04
1.0
.02 .27 .02
.02
.67
.03 .08 .01
.01 .01 .87
.01
.44 .03
.03 .50
.42
.17
.58
Fig. 7.
.90
.04
moving
.02
.92
.03
pouring
eating
.08
.01
.01
.84
.10
.44
.02
opening
.04
.12
placing
.02
.01
closing
.03
.25
Microwaving Food
.18
.03
.18
Stacking Objects
.03
Unstacking Objects
.06
Picking Objects
.05
Cleaning Objects
.01
.02
.90
.01
.14
.53
.42
.02
.02
.01
.02
Making Cereal
Taking Medicine
.03
.77
.01
.04
.02
.06
.97
drinking
null
.01
.02
.07
.01
.06
.83
Having Meal
80.1
81.1
81.8
75.8
3.9
0.8
2.2
4.4
76.7
78.3
80.0
74.2
4.2
0.9
1.2
4.6
1.0
1.0
1.0
.92
.08
1.0
.75
.17
.25
.67
.25
Taking Food
Arranging Objects
.58
n
s
c
p
o
d
e
p
m
ch ovin ourin ating rinkin peni lacin losin crub ull
bin
ng
ing g
g
g
g
g
g
rea
segmentation is given.
60.8 4.5 77.5 4.1
59.1 0.5 79.0 0.9
62.2 4.1 80.6 1.1
54.0 4.6 75.0 4.5
.03
scrubbing
.58
.25
reaching
High-level Activity
micro
macro
P/R
Prec.
Recall
10.0 0.0 10.0 0.0 10.0 0.0
34.7 2.9 24.2 1.5 35.8 2.2
30.6 3.5 27.4 3.6 31.2 3.7
26.4 2.0 23.7 1.0 23.7 1.0
59.7 1.8 56.3 2.2 58.3 1.9
27.4 5.2 31.8 6.3 27.7 5.3
81.4 1.3 83.2 1.2 80.8 1.4
80.6 2.6 81.9 2.2 80.0 2.6
84.7 2.4 85.3 2.0 84.2 2.5
79.0 4.7 78.6 4.1 78.3 4.9
.17
.17
.75
.17
.08
.33
.25
1.0
S
P
Ma Ta
M
U
C
T
H
A
kin kin icro tack nsta ickin lean akin rran avin
g F gin g
ck
in
g
g C g M wa ing
ere ed ving Ob ing Obje g Ob ood g O Mea
bje l
jec Ob
je
c
al icin
F
t
cts
oo
jec s
e
cts
d ts
ts
Confusion matrix for affordance labeling (left), sub-activity labeling (middle) and high-level activity labeling (right) of the test RGB-D videos.
object1
object2
object2
object2
object1
object1
Subject eating
Subject placing
placeable object1
Subject reaching
Subject opening
Subject reaching
Subject scrubbing
reachable object1
openable object1
object2
object2
Fig. 8. Descriptive output of our algorithm: Sequence of images from the taking food (Top Row), having meal (Middle Row) and cleaning objects
(Bottom Row) activities labeled with sub-activity and object affordance labels. A single frame is sampled from the temporal segment to represent it.
Fig. 9. Comparison of the sub-activity labeling of various segmentations. This activity involves the sub-activities: reaching, moving, pouring and placing
as colored in red, green, blue and magenta respectively. The x-axis denotes the time axis numbered with frame numbers. It can be seen that the various
individual segmentation labelings are not perfect and make different mistakes, but our method for merging these segmentations selects the correct label for
many frames.
9 http://openrave.org/
10 Our goal in this paper is on activity detection, therefore we pre-program
the response actions using existing open-source tools in ROS. In future, one
would need to make significant advances in several fields to make this useful
in practice, e.g., object detection Anand et al. (2012), grasping, human-robot
interaction, and so on.
Fig. 10. Robot performing the task of assisting humans: (top row) robot clearing the table after detecting having a meal activity, (middle row) robot
fetching a bottle of water after detecting taking a medicine activity and (third row) robot putting milk in the fridge after detecting making cereal activity.
First two columns show the robot observing the activity, third row shows the robot planning the response in simulation and the last three columns show
the robot performing the response action.
TABLE VI
ROBOT O BJECT M ANIPULATION R ESULTS
task
object movement
constrained movement
# instance accuracy
19
15
100
80
accuracy
(multi. obvs.)
100
100