0% found this document useful (0 votes)

67 views

ActivityRecognition 2020

This document discusses techniques for activity recognition from visual data such as video. It describes early approaches that used handcrafted features and classifiers as well as more recent deep learning methods. Key aspects covered include representing human motion, extracting descriptors, modeling dynamics, using graphical models and temporal templates, and applying convolutional and recurrent neural networks to videos.

Uploaded by

Zhengyou Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

ActivityRecognition 2020

Uploaded by

Zhengyou Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

A

Background
Activity Recognition
The classic study on visual analysis of biological
Wanqing Li1 , Zicheng Liu2 , and
motion using moving light display (MLD) [1] has
Zhengyou Zhang3
1 University of Wollongong, Wollongong, NSW, inspired tremendous interests among the com-
puter vision researchers in the problem of recog-
Australia
2 Microsoft Research, Microsoft Corporation, nizing human motion through visual information.
The commonly used devices to capture
Redmond, WA, USA
3 Tencent Robotics X and AI Labs, Shenzhen, human movement include human motion capture
(MOCAP) with or without markers, multiple
China
video camera systems, and single video camera
systems. A MOCAP device usually works under
controlled environment to capture the three-
Synonyms
dimensional (3D) joint locations or angles of
human bodies; multiple camera systems provide a
Action recognition
way to reconstruct 3D body models from multiple
viewpoint images. Both MOCAP and multiple
Related Concepts camera systems have physical limitations on
their use, and single camera systems are probably
Activity Detection more practical for many applications. The latter,
Behavior Analysis however, captures least visual information and,
Gesture Recognition hence, is the most challenging setting for activity
Motion Analysis recognition. In the past decade, research in
Motion Capture (MOCAP) activity recognition has mainly focused on single
camera systems. The development of commodity
depth cameras, such as Microsoft Kinect sensor,
Definition provides another feasible and economic way to
capture simultaneously two-dimensional color
Activity recognition refers to the process of iden- information and depth information of the human
tifying the types of movement performed by movement. In addition, techniques for locating
humans over a certain period of time. It is also two-dimensional (2D) or three-dimensional (3D)
known as action recognition when the period of joints of human bodies directly from video
time is relatively short. frames and depth maps offer practical alternatives
to traditional MOCAP devices.
© Springer Nature Switzerland AG 2020
K. Ikeuchi, Computer Vision,
https://doi.org/10.1007/978-3-030-03243-2_63-1
2 Activity Recognition

Regardless of which capturing device or which according to whether the dynamics of the activ-
modality, RGB (Red-Green-Blue) video, depth, ities is implicitly or explicitly modelled. In the
or skeleton, is used, a useful activity recognition first category, the problem of activity recognition
system has to be independent of anthropometric is cast from a temporal classification problem to a
differences among the individuals who perform static classification one by representing activities
the activities, independent of the speed at which using descriptors. A descriptor is extracted from
the activities are performed, robust against the observation sequence O, which intends to
varying acquisition settings and environmental capture both spatial and temporal information of
conditions (for instance, different viewpoints and the activity and, hence, to model the dynamics
illuminations), scalable to a large number of of the activity implicitly. Activity recognition
activities, and capable of recognizing activities is achieved by a conventional classifier such
in a continuous manner. Since a human body as support vector machines (SVM) or K-
is usually viewed as an articulated system nearest neighborhood (KNN). There are three
of the rigid links or segments connected by commonly used approaches to extract activity
joints, human motion can be considered as a descriptors. The first approach builds upon
continuous evolution of the spatial configuration motion energy images (MEI) [2], a binary
of the segments or body posture, and effective image representing the presence of motion in
representation of the body configuration and regions, and motion history images (MHI) [2],
its dynamics over time has been central to the a static image representing motion location and
research of human activity recognition. progressing path, and their variants. The second
approach considers a sequence of silhouettes as a
spatiotemporal volume, and an activity descriptor
Theory is computed from the volume. Typical examples
are differential geometric surface properties [3],
Let O = {o1 , o2 , · · · , on } be a sequence of space-time saliency, action dynamics, and shape
observations of the movement of a person over structure and orientation [4]. The third approach
a period of time. The observations can be a describes an activity using a set of spatiotemporal
sequence of joint angles, a sequence of color interest points (STIPs). The general concept is
images or silhouettes, a sequence of depth maps, first to detect STIPs [5] from the observations
or a combination of them. The task of activity O which is usually a video sequence. Features
recognition is to label O into one of the L classes are then extracted from a local volume around
C = {c1 , c2 , · · · , cL }. Therefore, solutions to each STIP, and a descriptor can be formed by
the problem of activity recognition are often simply putting them together to become a bag-
based on pattern recognition and machine of-features or by classifying the STIPs into a
leaning approaches, and an activity recognition set of vocabulary (i.e., a bag of visual words) and
system usually involves extracting features calculating the histogram of the occurrence of the
from the observation sequence O, learning a vocabulary within the observation sequence O.
classifier from training samples and classifying In addition to SVM and KNN, latent topic models
O using the trained classifier. The spatial and such as the probabilistic latent semantic analysis
temporal complexity of human activities has led (pLSA) model and latent Dirichlet allocation
researchers to cast the problem from different (LDA) were used in [6].
perspectives. Early work is mainly based on In the second category, the proposed methods
handcrafted features for recognition, and recent usually follow the concept that an activity is a
work has been driven greatly by deep learning. temporal evolution of the spatial configuration
of the body parts and, hence, emphasize more
Handcrafted-feature-based methods The on the dynamics of the activities than the
existing techniques based on handcrafted features methods in the first category. They usually
can be broadly divided into two categories extract a sequence of feature vectors, each feature
Activity Recognition 3

vector being extracted from a frame, or a small ond approach is to apply three-dimensional (3D)
neighborhood, of the observation sequence O. convolution to a segment of frames and then A
The two commonly used approaches are temporal temporal pooling to form video-based features,
templates and graphical models. The temporal- such as C3D [13], or to simply apply 3D convo-
template-based approach, typically, directly lution to an entire action instance [14]. The third
represents the dynamics through exemplar approach is to first extract frame-based features
sequences and adopts dynamic time warping using a CNN and then model temporal dynamics
(DTW) to compare an input sequence with the using a recurrent neural network (RNN), such
exemplar sequences, for instance, the work in [7]. as ConvLSTM [15]. Typically, video-based fea-
The graphical-model-based approach includes tures are obtained from the last time-step of the
both generative and discriminative models. RNN. The fourth approach is to generate dynamic
The most prominent generative model is the images [16] through rank pooling and to apply
hidden Markov model (HMM), where sequences a CNN to extract features from the dynamic
of observed features are grouped into similar images. Each approach has its own strength and
configuration, i.e., states, and both the probability weakness. For instance, the first approach tends
distribution of the observations at each state and to focus on short-term rather than long-term tem-
the temporal transitional functions between these poral information. The second approach, if it
states are learned from training samples [8]. is implemented using temporal pooling, is also
A more general generative graphical model, weak at capturing long-term temporal informa-
referred to as an action graph, was established tion. Such weakness can be mitigated to some
in [9], in which an activity is encoded by one extent by applying 3D convolution to the entire
or multiple paths in the action graph. Due to action instance. In the third approach, extraction
the sharing mechanism, the action graph can of spatial information (e.g., using a CNN) is
be trained and also easily expanded to new separated from temporal modeling (e.g., using
actions with a small number of training samples. a RNN). It tends to model long-term dynam-
The generative graphical models often rely on ics more effectively than short-term dynamics.
an assumption of statistical independence of The dynamic image approach depends on how
observations to compute the joint probability of effectively both spatial and temporal information
the states and the observations. This makes it hard can be encoded into one or multiple images.
to model the long-term contextual dependencies The commonly used ranking pooling method
which is important to the recognition of activities to generate the dynamic images tends to sup-
over a long period of time. The discriminative press small motion. Similar approaches have been
models, such as conditional random fields developed to extract deep features from skeleton
(CRF) [10], offer an effective way to model long- sequences [11]. For instance, Yan et al. [17]
term dependency and compute the conditional proposed Spatial-Temporal Graph Convolutional
probability that maps the observations to the Networks (ST-GCN). In [18], spatial and tempo-
motion class labels. ral information in a skeleton sequence is encoded
into multiple texture images, and CNNs are used
Deep-learning-based methods There are four to extract features. In [19], a skeleton sequence
commonly used approaches to leveraging the is fed into an RNN directly, and features are
capability of deep neural networks for action extracted from the last-time step of the RNN.
recognition from RGB video or depth sequences Compared with traditional RNN, the recently
[11]. The first approach is to employ convolu- proposed independently recurrent neural network
tional neural networks (CNNs) to extract fea- (IndRNN) [20] is promising and achieves the
tures from single frames or a stack of frames state-of-the-art results.
sampled from a video sequence representing an In many realistic applications, an activity may
instance of actions. The widely used two stream occupy only a small portion of the entire space-
architecture [12] is a typical example. The sec- time volume of a video sequence. In such sit-
4 Activity Recognition

uations, it does not make sense to classify the 7. Wang L, Suter D (2007) Learning and matching of
entire video. Instead, one needs to locate the dynamic shape manifolds for human action recogni-
tion. IEEE Trans Image Process 16:1646–1661
activity in space and time. This is commonly 8. Oliver N, Garg A, Horvits E (2004) Layered repre-
known as an activity detection or action detection sentations for learning and inferring office activity
problem. In addition, continuous recognition of from multiple sensory channels. Comput Vis Image
activities under realistic conditions, such as with Underst 96:163–180
9. Li W, Zhang Z, Liu Z (2008) Expandable data-driven
viewpoint invariance and large number of activ- graphical modeling of human actions based on salient
ities, remains challenging though the extensive postures. IEEE Trans Circuits Syst Video Technol
effort, and progress have been made in activity 18(11):1499–1510
recognition research in the past decade. 10. Wang Y, Mori G (2011) Hidden part models
for human action recognition: probabilistic versus
max margin. IEEE Trans Pattern Anal Mach Intell
33(7):1310–1323
Application 11. Wang P, Li W, Ogunbona P, Wan J, Escalera S
(2018) RGB-D-based human motion recognition with
Activity recognition has many potential applica- deep learning: a survey. Comput Vis Image Underst
tions. It is one of the key enabling technologies 171:118–139
12. Simonyan K, Zisserman A (2014) Two-stream convo-
in security and surveillance for automatic moni- lutional networks for action recognition in videos. In:
toring of human activities in a public space and Advances in neural information processing systems,
of activities of daily living of elderly people at pp 568–576
13. Tran D, Bourdev L, Fergus R, Torresani L, Paluri
home. It is also essential for autonomous retail
M (2015) Learning spatiotemporal features with
shops and autonomous driving. Robust under- 3D convolutional networks. In: Proceedings of the
standing and interpretation of human activities IEEE international conference on computer vision,
also allows a natural way for humans to interact pp 4489–4497
14. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (2013)
with machines. A proper modeling of the spatial
3D convolutional neural networks for human action
configuration and dynamics of human motion recognition. IEEE Trans Pattern Anal Mach Intell
would enable realistic synthesis of human motion 35(1):221–231
for gaming and movie industry and help train 15. Xingjian SHI, Chen Z, Wang H, Yeung D-Y, Wong
W-K, Woo W (2015) Convolutional LSTM network:
humanoid robots in a flexible and economic way.
a machine learning approach for precipitation now-
In sports, activity recognition technology has also casting. In: Advances in neural information process-
been used in training and in sports video retrieval. ing systems, pp 802–810
16. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould
S (2016) Dynamic image networks for action recog-
References nition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 3034–
3042
1. Johansson G (1973) Visual perception of biological 17. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph
motion and a model for its analysis. Percept Psy- convolutional networks for skeleton-based action
chophys 14(2):201–211 recognition. In: Thirty-second AAAI conference on
2. Bobick A, Davis J (2001) The recognition of human artificial intelligence
movement using temporal templates. IEEE Trans 18. Li C, Hou Y, Wang P, Li W (2017) Joint dis-
Pattern Anal Mach Intell 23(3):257–267 tance maps based action recognition with convolu-
3. Yilmaz A, Shah M (2008) A differential geometric tional neural networks. IEEE Signal Process Lett 24:
approach to representing the human actions. Comput 624–628
Vis Image Underst 109(3):335–351 19. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-
4. Gorelick L, Blank M, Shechtman E, Irani M, Basri temporal LSTM with trust gates for 3D human action
R (2007) Actions as space-time shapes. IEEE Trans recognition. In: Proceedings of the European confer-
Pattern Anal Mach Intell 29(12):2247–2253 ence on computer vision, pp 816–833
5. Laptev I, Lindeberg T (2003) Space-time interest 20. Li S, Li W, Cook C, Zhu C, Gao Y (2018) Indepen-
points. In: International conference on computer dently recurrent neural network (INDRNN): building
vision, pp 432–439 a longer and deeper RNN. In: Proceedings of the
6. Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised IEEE conference on computer vision and pattern
learning of human action categories using spatial- recognition, pp 5457–5466
temporalwords. Int J Comput Vis 79(3):299–318

Human Activity Recognition From Video
No ratings yet
Human Activity Recognition From Video
24 pages
Motionanalysis DRAFT Ilovepdf Compressed
No ratings yet
Motionanalysis DRAFT Ilovepdf Compressed
39 pages
Session 03 - Paper 42
No ratings yet
Session 03 - Paper 42
12 pages
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
No ratings yet
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
9 pages
Action Recognition Based On Multi-Level Representation of 3D Shape
No ratings yet
Action Recognition Based On Multi-Level Representation of 3D Shape
9 pages
Human Action Recognition On Raw Depth Maps
No ratings yet
Human Action Recognition On Raw Depth Maps
13 pages
Recognizing Human Activities by Key Frame in Video Sequences
No ratings yet
Recognizing Human Activities by Key Frame in Video Sequences
8 pages
Video Survivallence
No ratings yet
Video Survivallence
3 pages
Sensors 19 03160
No ratings yet
Sensors 19 03160
25 pages
Human Activity
No ratings yet
Human Activity
25 pages
Action Recognition
No ratings yet
Action Recognition
14 pages
Human Action Recognition On Raw Depth Maps
No ratings yet
Human Action Recognition On Raw Depth Maps
4 pages
APCS Thesis-Proposal
No ratings yet
APCS Thesis-Proposal
18 pages
Human Activity Recognition Based On Spatial Transform in Video Surveillance
No ratings yet
Human Activity Recognition Based On Spatial Transform in Video Surveillance
5 pages
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
No ratings yet
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
3 pages
Human Activity Recognization
No ratings yet
Human Activity Recognization
8 pages
Fast Human Activity Recognition
No ratings yet
Fast Human Activity Recognition
8 pages
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
No ratings yet
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
12 pages
Activity Modeling and Recognition Using Shape Theory: Rama Chellappa, Namrata Vaswani, Amit K.Roy Chowdhury
No ratings yet
Activity Modeling and Recognition Using Shape Theory: Rama Chellappa, Namrata Vaswani, Amit K.Roy Chowdhury
4 pages
Human Activity Recognition Process Using 3-D Posture Data
No ratings yet
Human Activity Recognition Process Using 3-D Posture Data
12 pages
Time Invariant Gesture Recognition by Modelling Body Posture Space
No ratings yet
Time Invariant Gesture Recognition by Modelling Body Posture Space
10 pages
Skeletonbased-Human-ActionInteraction-Classification-in-Sparse-Image-Sequences
No ratings yet
Skeletonbased-Human-ActionInteraction-Classification-in-Sparse-Image-Sequences
14 pages
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
No ratings yet
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
10 pages
Human Activity Detection Using Deep - 2-1
No ratings yet
Human Activity Detection Using Deep - 2-1
8 pages
Research Article: Activity Representation Using 3D Shape Models
No ratings yet
Research Article: Activity Representation Using 3D Shape Models
22 pages
Human Activity Detection Using Pose Net
No ratings yet
Human Activity Detection Using Pose Net
5 pages
2 PB
No ratings yet
2 PB
5 pages
Analyzing Features For Activity Recognition: Tâm Huynh and Bernt Schiele
No ratings yet
Analyzing Features For Activity Recognition: Tâm Huynh and Bernt Schiele
6 pages
Historia Dle Movimiento
No ratings yet
Historia Dle Movimiento
11 pages
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
No ratings yet
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
13 pages
Reference Paper 3
No ratings yet
Reference Paper 3
16 pages
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
No ratings yet
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
14 pages
Computers 02 00088 v2 PDF
No ratings yet
Computers 02 00088 v2 PDF
44 pages
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
No ratings yet
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
20 pages
MSR Action 3D
No ratings yet
MSR Action 3D
8 pages
Human Activity Recognition With Sensor Approach
No ratings yet
Human Activity Recognition With Sensor Approach
179 pages
An Efficient Method For Recognition of Human-In-Motion Action Bas
No ratings yet
An Efficient Method For Recognition of Human-In-Motion Action Bas
16 pages
Sensors 24 02491 v2
No ratings yet
Sensors 24 02491 v2
15 pages
WangH2013-densetrajectories-IJCV
No ratings yet
WangH2013-densetrajectories-IJCV
21 pages
Machine Recognition of Human Activities: A Survey
No ratings yet
Machine Recognition of Human Activities: A Survey
15 pages
Journal Tiis 11-2 TIISVol11No2-28
No ratings yet
Journal Tiis 11-2 TIISVol11No2-28
16 pages
Expert Systems With Applications: D.K. Vishwakarma, Rajiv Kapoor
No ratings yet
Expert Systems With Applications: D.K. Vishwakarma, Rajiv Kapoor
9 pages
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
No ratings yet
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
64 pages
Abnormal Activity Detection Using HOG Features and SVM Classifier
No ratings yet
Abnormal Activity Detection Using HOG Features and SVM Classifier
4 pages
10.1007@s42979-020-00169-0
No ratings yet
10.1007@s42979-020-00169-0
10 pages
V8i2 1320
No ratings yet
V8i2 1320
4 pages
Seminar PPT On HAR Depth
No ratings yet
Seminar PPT On HAR Depth
37 pages
Seminar I
No ratings yet
Seminar I
26 pages
Jurnal Wahyu Affandi
No ratings yet
Jurnal Wahyu Affandi
7 pages
chiang2015
No ratings yet
chiang2015
14 pages
Space Time Actions ICCV2005
No ratings yet
Space Time Actions ICCV2005
9 pages
SLFLSDFKSFLDKJ
No ratings yet
SLFLSDFKSFLDKJ
3 pages
Unusual Crowd Activity Detection Using Opencv and Motion Influence Map
No ratings yet
Unusual Crowd Activity Detection Using Opencv and Motion Influence Map
6 pages
Human Activity Recognition in Smart Environments
No ratings yet
Human Activity Recognition in Smart Environments
8 pages
A New Method For Violence Detection in Surveillance Scenes
No ratings yet
A New Method For Violence Detection in Surveillance Scenes
23 pages
Nibali Extraction and Classification CVPR 2017 Paper
No ratings yet
Nibali Extraction and Classification CVPR 2017 Paper
11 pages
10.1007@s00371 020 01868 8
No ratings yet
10.1007@s00371 020 01868 8
15 pages
Lertniphonphan 2011
No ratings yet
Lertniphonphan 2011
6 pages
Vision For The Blind
No ratings yet
Vision For The Blind
6 pages
Training Deep Networks For Facial Expression Recognition With Crowd-Sourced Label Distribution - 1608.01041
No ratings yet
Training Deep Networks For Facial Expression Recognition With Crowd-Sourced Label Distribution - 1608.01041
6 pages
Expressive Expression Mapping With Ratio Images
No ratings yet
Expressive Expression Mapping With Ratio Images
7 pages
A Robust Technique For Matching Two Uncalibrated Images Through The Recovery of The Unknown Epipolar Geometry
No ratings yet
A Robust Technique For Matching Two Uncalibrated Images Through The Recovery of The Unknown Epipolar Geometry
43 pages

ActivityRecognition 2020

Uploaded by

ActivityRecognition 2020

Uploaded by

A

You might also like