ActivityRecognition 2020
ActivityRecognition 2020
Background
Activity Recognition
The classic study on visual analysis of biological
Wanqing Li1 , Zicheng Liu2 , and
motion using moving light display (MLD) [1] has
Zhengyou Zhang3
1 University of Wollongong, Wollongong, NSW, inspired tremendous interests among the com-
puter vision researchers in the problem of recog-
Australia
2 Microsoft Research, Microsoft Corporation, nizing human motion through visual information.
The commonly used devices to capture
Redmond, WA, USA
3 Tencent Robotics X and AI Labs, Shenzhen, human movement include human motion capture
(MOCAP) with or without markers, multiple
China
video camera systems, and single video camera
systems. A MOCAP device usually works under
controlled environment to capture the three-
Synonyms
dimensional (3D) joint locations or angles of
human bodies; multiple camera systems provide a
Action recognition
way to reconstruct 3D body models from multiple
viewpoint images. Both MOCAP and multiple
Related Concepts camera systems have physical limitations on
their use, and single camera systems are probably
Activity Detection more practical for many applications. The latter,
Behavior Analysis however, captures least visual information and,
Gesture Recognition hence, is the most challenging setting for activity
Motion Analysis recognition. In the past decade, research in
Motion Capture (MOCAP) activity recognition has mainly focused on single
camera systems. The development of commodity
depth cameras, such as Microsoft Kinect sensor,
Definition provides another feasible and economic way to
capture simultaneously two-dimensional color
Activity recognition refers to the process of iden- information and depth information of the human
tifying the types of movement performed by movement. In addition, techniques for locating
humans over a certain period of time. It is also two-dimensional (2D) or three-dimensional (3D)
known as action recognition when the period of joints of human bodies directly from video
time is relatively short. frames and depth maps offer practical alternatives
to traditional MOCAP devices.
© Springer Nature Switzerland AG 2020
K. Ikeuchi, Computer Vision,
https://doi.org/10.1007/978-3-030-03243-2_63-1
2 Activity Recognition
Regardless of which capturing device or which according to whether the dynamics of the activ-
modality, RGB (Red-Green-Blue) video, depth, ities is implicitly or explicitly modelled. In the
or skeleton, is used, a useful activity recognition first category, the problem of activity recognition
system has to be independent of anthropometric is cast from a temporal classification problem to a
differences among the individuals who perform static classification one by representing activities
the activities, independent of the speed at which using descriptors. A descriptor is extracted from
the activities are performed, robust against the observation sequence O, which intends to
varying acquisition settings and environmental capture both spatial and temporal information of
conditions (for instance, different viewpoints and the activity and, hence, to model the dynamics
illuminations), scalable to a large number of of the activity implicitly. Activity recognition
activities, and capable of recognizing activities is achieved by a conventional classifier such
in a continuous manner. Since a human body as support vector machines (SVM) or K-
is usually viewed as an articulated system nearest neighborhood (KNN). There are three
of the rigid links or segments connected by commonly used approaches to extract activity
joints, human motion can be considered as a descriptors. The first approach builds upon
continuous evolution of the spatial configuration motion energy images (MEI) [2], a binary
of the segments or body posture, and effective image representing the presence of motion in
representation of the body configuration and regions, and motion history images (MHI) [2],
its dynamics over time has been central to the a static image representing motion location and
research of human activity recognition. progressing path, and their variants. The second
approach considers a sequence of silhouettes as a
spatiotemporal volume, and an activity descriptor
Theory is computed from the volume. Typical examples
are differential geometric surface properties [3],
Let O = {o1 , o2 , · · · , on } be a sequence of space-time saliency, action dynamics, and shape
observations of the movement of a person over structure and orientation [4]. The third approach
a period of time. The observations can be a describes an activity using a set of spatiotemporal
sequence of joint angles, a sequence of color interest points (STIPs). The general concept is
images or silhouettes, a sequence of depth maps, first to detect STIPs [5] from the observations
or a combination of them. The task of activity O which is usually a video sequence. Features
recognition is to label O into one of the L classes are then extracted from a local volume around
C = {c1 , c2 , · · · , cL }. Therefore, solutions to each STIP, and a descriptor can be formed by
the problem of activity recognition are often simply putting them together to become a bag-
based on pattern recognition and machine of-features or by classifying the STIPs into a
leaning approaches, and an activity recognition set of vocabulary (i.e., a bag of visual words) and
system usually involves extracting features calculating the histogram of the occurrence of the
from the observation sequence O, learning a vocabulary within the observation sequence O.
classifier from training samples and classifying In addition to SVM and KNN, latent topic models
O using the trained classifier. The spatial and such as the probabilistic latent semantic analysis
temporal complexity of human activities has led (pLSA) model and latent Dirichlet allocation
researchers to cast the problem from different (LDA) were used in [6].
perspectives. Early work is mainly based on In the second category, the proposed methods
handcrafted features for recognition, and recent usually follow the concept that an activity is a
work has been driven greatly by deep learning. temporal evolution of the spatial configuration
of the body parts and, hence, emphasize more
Handcrafted-feature-based methods The on the dynamics of the activities than the
existing techniques based on handcrafted features methods in the first category. They usually
can be broadly divided into two categories extract a sequence of feature vectors, each feature
Activity Recognition 3
vector being extracted from a frame, or a small ond approach is to apply three-dimensional (3D)
neighborhood, of the observation sequence O. convolution to a segment of frames and then A
The two commonly used approaches are temporal temporal pooling to form video-based features,
templates and graphical models. The temporal- such as C3D [13], or to simply apply 3D convo-
template-based approach, typically, directly lution to an entire action instance [14]. The third
represents the dynamics through exemplar approach is to first extract frame-based features
sequences and adopts dynamic time warping using a CNN and then model temporal dynamics
(DTW) to compare an input sequence with the using a recurrent neural network (RNN), such
exemplar sequences, for instance, the work in [7]. as ConvLSTM [15]. Typically, video-based fea-
The graphical-model-based approach includes tures are obtained from the last time-step of the
both generative and discriminative models. RNN. The fourth approach is to generate dynamic
The most prominent generative model is the images [16] through rank pooling and to apply
hidden Markov model (HMM), where sequences a CNN to extract features from the dynamic
of observed features are grouped into similar images. Each approach has its own strength and
configuration, i.e., states, and both the probability weakness. For instance, the first approach tends
distribution of the observations at each state and to focus on short-term rather than long-term tem-
the temporal transitional functions between these poral information. The second approach, if it
states are learned from training samples [8]. is implemented using temporal pooling, is also
A more general generative graphical model, weak at capturing long-term temporal informa-
referred to as an action graph, was established tion. Such weakness can be mitigated to some
in [9], in which an activity is encoded by one extent by applying 3D convolution to the entire
or multiple paths in the action graph. Due to action instance. In the third approach, extraction
the sharing mechanism, the action graph can of spatial information (e.g., using a CNN) is
be trained and also easily expanded to new separated from temporal modeling (e.g., using
actions with a small number of training samples. a RNN). It tends to model long-term dynam-
The generative graphical models often rely on ics more effectively than short-term dynamics.
an assumption of statistical independence of The dynamic image approach depends on how
observations to compute the joint probability of effectively both spatial and temporal information
the states and the observations. This makes it hard can be encoded into one or multiple images.
to model the long-term contextual dependencies The commonly used ranking pooling method
which is important to the recognition of activities to generate the dynamic images tends to sup-
over a long period of time. The discriminative press small motion. Similar approaches have been
models, such as conditional random fields developed to extract deep features from skeleton
(CRF) [10], offer an effective way to model long- sequences [11]. For instance, Yan et al. [17]
term dependency and compute the conditional proposed Spatial-Temporal Graph Convolutional
probability that maps the observations to the Networks (ST-GCN). In [18], spatial and tempo-
motion class labels. ral information in a skeleton sequence is encoded
into multiple texture images, and CNNs are used
Deep-learning-based methods There are four to extract features. In [19], a skeleton sequence
commonly used approaches to leveraging the is fed into an RNN directly, and features are
capability of deep neural networks for action extracted from the last-time step of the RNN.
recognition from RGB video or depth sequences Compared with traditional RNN, the recently
[11]. The first approach is to employ convolu- proposed independently recurrent neural network
tional neural networks (CNNs) to extract fea- (IndRNN) [20] is promising and achieves the
tures from single frames or a stack of frames state-of-the-art results.
sampled from a video sequence representing an In many realistic applications, an activity may
instance of actions. The widely used two stream occupy only a small portion of the entire space-
architecture [12] is a typical example. The sec- time volume of a video sequence. In such sit-
4 Activity Recognition
uations, it does not make sense to classify the 7. Wang L, Suter D (2007) Learning and matching of
entire video. Instead, one needs to locate the dynamic shape manifolds for human action recogni-
tion. IEEE Trans Image Process 16:1646–1661
activity in space and time. This is commonly 8. Oliver N, Garg A, Horvits E (2004) Layered repre-
known as an activity detection or action detection sentations for learning and inferring office activity
problem. In addition, continuous recognition of from multiple sensory channels. Comput Vis Image
activities under realistic conditions, such as with Underst 96:163–180
9. Li W, Zhang Z, Liu Z (2008) Expandable data-driven
viewpoint invariance and large number of activ- graphical modeling of human actions based on salient
ities, remains challenging though the extensive postures. IEEE Trans Circuits Syst Video Technol
effort, and progress have been made in activity 18(11):1499–1510
recognition research in the past decade. 10. Wang Y, Mori G (2011) Hidden part models
for human action recognition: probabilistic versus
max margin. IEEE Trans Pattern Anal Mach Intell
33(7):1310–1323
Application 11. Wang P, Li W, Ogunbona P, Wan J, Escalera S
(2018) RGB-D-based human motion recognition with
Activity recognition has many potential applica- deep learning: a survey. Comput Vis Image Underst
tions. It is one of the key enabling technologies 171:118–139
12. Simonyan K, Zisserman A (2014) Two-stream convo-
in security and surveillance for automatic moni- lutional networks for action recognition in videos. In:
toring of human activities in a public space and Advances in neural information processing systems,
of activities of daily living of elderly people at pp 568–576
13. Tran D, Bourdev L, Fergus R, Torresani L, Paluri
home. It is also essential for autonomous retail
M (2015) Learning spatiotemporal features with
shops and autonomous driving. Robust under- 3D convolutional networks. In: Proceedings of the
standing and interpretation of human activities IEEE international conference on computer vision,
also allows a natural way for humans to interact pp 4489–4497
14. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (2013)
with machines. A proper modeling of the spatial
3D convolutional neural networks for human action
configuration and dynamics of human motion recognition. IEEE Trans Pattern Anal Mach Intell
would enable realistic synthesis of human motion 35(1):221–231
for gaming and movie industry and help train 15. Xingjian SHI, Chen Z, Wang H, Yeung D-Y, Wong
W-K, Woo W (2015) Convolutional LSTM network:
humanoid robots in a flexible and economic way.
a machine learning approach for precipitation now-
In sports, activity recognition technology has also casting. In: Advances in neural information process-
been used in training and in sports video retrieval. ing systems, pp 802–810
16. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould
S (2016) Dynamic image networks for action recog-
References nition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 3034–
3042
1. Johansson G (1973) Visual perception of biological 17. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph
motion and a model for its analysis. Percept Psy- convolutional networks for skeleton-based action
chophys 14(2):201–211 recognition. In: Thirty-second AAAI conference on
2. Bobick A, Davis J (2001) The recognition of human artificial intelligence
movement using temporal templates. IEEE Trans 18. Li C, Hou Y, Wang P, Li W (2017) Joint dis-
Pattern Anal Mach Intell 23(3):257–267 tance maps based action recognition with convolu-
3. Yilmaz A, Shah M (2008) A differential geometric tional neural networks. IEEE Signal Process Lett 24:
approach to representing the human actions. Comput 624–628
Vis Image Underst 109(3):335–351 19. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-
4. Gorelick L, Blank M, Shechtman E, Irani M, Basri temporal LSTM with trust gates for 3D human action
R (2007) Actions as space-time shapes. IEEE Trans recognition. In: Proceedings of the European confer-
Pattern Anal Mach Intell 29(12):2247–2253 ence on computer vision, pp 816–833
5. Laptev I, Lindeberg T (2003) Space-time interest 20. Li S, Li W, Cook C, Zhu C, Gao Y (2018) Indepen-
points. In: International conference on computer dently recurrent neural network (INDRNN): building
vision, pp 432–439 a longer and deeper RNN. In: Proceedings of the
6. Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised IEEE conference on computer vision and pattern
learning of human action categories using spatial- recognition, pp 5457–5466
temporalwords. Int J Comput Vis 79(3):299–318